## **Import Libraries**

In [1]:
#!pip install selenium
#!pip install webdriver-manager
#!pip install pyyaml ua-parser user-agents fake-useragent

################ WEB SCRAPING MODULES ############
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.utils import ChromeType
from selenium.webdriver.common.by import By
import bs4
from fake_useragent import UserAgent
import requests
################ TIME MODLULES ###################
import time
from datetime import date 
import datetime
############## DATA MANIPULATION MODULES #########
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords

## **Define web source**

In [5]:
link = 'https://www.myautismteam.com/resources/autism-an-overview'

## **Read 100++ questions list**

In [6]:
df = pd.read_excel('ASDquestions1.xlsx',engine='openpyxl')

df[link]=np.nan
df

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,
1,How common is autism?,According to a 2020 report commissioned by the...,
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,
3,What is Autism?,,
4,What is Asperger’s Syndrome?,,
...,...,...,...
97,What are some ways that parents can reduce the...,,
98,Do some families deal with stress better than ...,,
99,Do siblings suffer increased stress as a resul...,,
100,What can I do about my children’s stress?,,


## **Scrape QA pairs from website**

In [7]:
# Open webpage in a new window for scraping
#driver = webdriver.Chrome(ChromeDriverManager(chrome_type=ChromeType.GOOGLE).install())   #cannot fix in colab

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(link)



Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [/home/aceirus/.wdm/drivers/chromedriver/linux64/92.0.4515.107/chromedriver] found in cache


In [8]:
# Parse text in webpage
source = driver.page_source
soup = bs4.BeautifulSoup(source, 'html.parser')

In [9]:
# Search the questions mentioned in webpage
quesList = []
for ques in soup.find_all('b'):
    print(ques.text)
    quesList.append(ques.text)

The History of Autism
How Common Is Autism?
How Is Autism Diagnosed?
How Is Autism Treated?
What Causes Autism?
Resources
External resource
MyAutismTeam resources
FAQs
Is there a cure for autism?
Is autism permanent?
Is autism contagious?
Are rates of autism increasing?
Is autism a new condition?
Become a Member
Become a Subscriber
Thank you for signing up.


In [47]:
# Search the answers for questions
ansList = []
#for ans in soup.find_all('br'):
#for ans in soup.find_all('span',{'class':'s1'}):
for ans in soup.find_all('span',{'style':'font-size:16px;'}):

    print(ans.text)
    print('*'*100)
    ansList.append(ans.text)

Autism spectrum disorder (ASD) encompasses a wide range of neurological and developmental disorders. Autism can cause delays and disabilities related to learning, communication, social and motor skills, and behavior. Many people with autism also live with sensory sensitivities, seizures, gastrointestinal problems, anxiety disorders, and attention deficit and hyperactivity disorder (ADHD).Autism is referred to as a spectrum because its effects can range from mild to very severe and debilitating. Some autistic people function independently for their age level, while others need assistance with basic functions for their entire lives. Approximately one-third of people with autism never begin talking, although many learn nonverbal forms of communication. Approximately one-third of autistic people have intellectual disabilities, while others have average or significantly above-average intellectual abilities. Despite these differences, most people with ASD share difficulties in communicating,

In [48]:
# Search the answers for questions
ansList2 = []
#for ans in soup.find_all('br'):
for ans in soup.find_all('span',{'class':'s1'}):
#for ans in soup.find_all('span',{'style':'font-size:16px;'}):

    print(ans.text)
    print('*'*100)
    ansList2.append(ans.text)

Autism spectrum disorder (ASD) encompasses a wide range of neurological and developmental disorders. Autism can cause delays and disabilities related to learning, communication, social and motor skills, and behavior. Many people with autism also live with sensory sensitivities, seizures, gastrointestinal problems, anxiety disorders, and attention deficit and hyperactivity disorder (ADHD).
****************************************************************************************************
Autism is referred to as a spectrum because its effects can range from mild to very severe and debilitating. Some autistic people function independently for their age level, while others need assistance with basic functions for their entire lives. Approximately one-third of people with autism never begin talking, although many learn nonverbal forms of communication. Approximately one-third of autistic people have intellectual disabilities, while others have average or significantly above-average intell

## **Check which questions are similar**

In [29]:
stop_words = set(stopwords.words('english'))

def clean_text(sent):
    sent = sent.lower() # lowercase
    sent = re.sub(r'[^\w\s]', '', sent) # remove punctuations
    sent = re.sub('Autism Spectrum Disorder','ASD',sent) # Compress term
    sent = [w for w in sent.split() if not w.lower() in stop_words] # Remove stopwords
    sent = " ".join(sent)
    return sent

In [30]:
def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

In [69]:
# try to match websource questions with our own 100 questions list
for c1,i in enumerate(quesList):
    i2 = clean_text(i)
    list1 = i2.split()
    
    temp1 = 0.0
    temp2 = ''
    temp3 = ''
    temp4 = 0
    
    for c2,j in enumerate(df['Question']):
        j2 = clean_text(j)
        list2 = j2.split()
            
        sim = jaccard_similarity(list1, list2)
        if(sim>temp1):
            temp1 = sim
            temp2 = j
            temp3 = i
            temp4 = c2
            
    if(temp1>=0.3):  # sim threshold
        print('Website --> ',temp3,'(Index {})'.format(c1))
        print('100 questions list --> ',temp2,'(Index {})'.format(temp4))
        print('similarity:', temp1)
        print('*'*100)

['autism', 'spectrum', 'disorders', 'asd']
['common', 'autism']
['causes', 'autism', 'cured']
['autism']
['aspergers', 'syndrome']
['tell', 'autism', 'aspergers', 'syndrome']
['pervasive', 'developmental', 'disorder', 'otherwise', 'specified', 'pddnos']
['rett', 'syndrome']
['childhood', 'disintegrative', 'disorder']
['prognosis', 'children', 'autism']
['diseases', 'symptoms', 'autism']
['autistic', 'children', 'commonly', 'suffer', 'illnesses']
['association', 'autism', 'tourettes', 'syndrome']
['heard', 'symptoms', 'autism', 'exaggerated', 'incorrectly', 'portrayed', 'media', 'common', 'myths', 'autism']
['risk', 'factors', 'autism']
['symptoms', 'autism', 'parent', 'look']
['doctors', 'diagnose', 'children', 'autism']
['screening', 'tools', 'autism']
['common', 'autism', 'screening', 'tools', 'used', 'today']
['medical', 'tests', 'doctor', 'perform', 'making', 'diagnosis', 'autism']
['multidisciplinary', 'team', 'help', 'diagnose', 'autistic', 'child']
['tests', 'recommended', 'diag

In [32]:
for i in range (0,len(quesList)):
    print("Index #{0:d}: {1:s}".format(i,quesList[i]))

Index #0: The History of Autism
Index #1: How Common Is Autism?
Index #2: How Is Autism Diagnosed?
Index #3: How Is Autism Treated?
Index #4: What Causes Autism?
Index #5: Resources
Index #6: External resource
Index #7: MyAutismTeam resources
Index #8: FAQs
Index #9: Is there a cure for autism?
Index #10: Is autism permanent?
Index #11: Is autism contagious?
Index #12: Are rates of autism increasing?
Index #13: Is autism a new condition?
Index #14: Become a Member
Index #15: Become a Subscriber
Index #16: Thank you for signing up.


In [33]:
for i in range (0,len(ansList)):
    print("Index #{0:d}: {1:s}".format(i,ansList[i]))

Index #0: Autism spectrum disorder (ASD) encompasses a wide range of neurological and developmental disorders. Autism can cause delays and disabilities related to learning, communication, social and motor skills, and behavior. Many people with autism also live with sensory sensitivities, seizures, gastrointestinal problems, anxiety disorders, and attention deficit and hyperactivity disorder (ADHD).Autism is referred to as a spectrum because its effects can range from mild to very severe and debilitating. Some autistic people function independently for their age level, while others need assistance with basic functions for their entire lives. Approximately one-third of people with autism never begin talking, although many learn nonverbal forms of communication. Approximately one-third of autistic people have intellectual disabilities, while others have average or significantly above-average intellectual abilities. Despite these differences, most people with ASD share difficulties in comm

In [49]:
for j in range (0,len(ansList2)):
    print("Index #{0:d}: {1:s}".format(j,ansList2[j]))

Index #0: Autism spectrum disorder (ASD) encompasses a wide range of neurological and developmental disorders. Autism can cause delays and disabilities related to learning, communication, social and motor skills, and behavior. Many people with autism also live with sensory sensitivities, seizures, gastrointestinal problems, anxiety disorders, and attention deficit and hyperactivity disorder (ADHD).
Index #1: Autism is referred to as a spectrum because its effects can range from mild to very severe and debilitating. Some autistic people function independently for their age level, while others need assistance with basic functions for their entire lives. Approximately one-third of people with autism never begin talking, although many learn nonverbal forms of communication. Approximately one-third of autistic people have intellectual disabilities, while others have average or significantly above-average intellectual abilities. Despite these differences, most people with ASD share difficult

In [34]:
# add websource answer to matched question in existing dataframe
df[link].loc[1]=ansList[2]
df[link].loc[2]=ansList[5]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [42]:
df[link].loc[1]

'It is estimated that in the United States 1.6 percent of children, or one in 59, have autism – this estimate includes one in 37 boys and one in 151 girls. Autism is more difficult to diagnose in adults, making it harder to estimate how many adults live with autism. Worldwide, between 1 and 2 percent of people are believed to be on the autism spectrum. Boys are about four times more likely to be diagnosed with autism as girls. New studies suggest that symptoms of autism may differ in girls, making them less likely to be diagnosed. All races, ethnicities, and socioeconomic classes are equally affected by autism.'

In [45]:
# update with new valid questions list
quesListUpd = quesList[2:4]+quesList[9:14]

quesListUpd

['How Is Autism Diagnosed?',
 'How Is Autism Treated?',
 'Is there a cure for autism?',
 'Is autism permanent?',
 'Is autism contagious?',
 'Are rates of autism increasing?',
 'Is autism a new condition?']

In [50]:
# manually select answers to updated questions list
ansListUpd = [ansList[3],
              ansList[4],
              ansList2[57],
              ansList2[59],
              ansList2[61],
              ansList2[63],
              ansList2[65]]
ansListUpd

['There is no one single conclusive test for autism. There are differences between screenings at well-child visits, school evaluations, and medical diagnoses. An autism screening does not provide a diagnosis but checks for red flags that might indicate a need for further evaluation. A school evaluation or educational determination is made by a team of education professionals and focuses on rating the level of disability. Medical diagnosis is performed by a doctor such as a developmental pediatrician, child psychologist, child psychiatrist, or neuropsychologist. For a medical diagnosis, the doctor will ask questions about the child’s developmental history, assess symptoms, and test cognitive abilities, language skills, and age-appropriate physical skills.While some children may show signs of autism within the first few months of life, most are diagnosed between ages 3 and 6. Older children and adolescents may be recommended for evaluation and diagnosis by teachers. Diagnosing autism in 

In [51]:
# Create new dataframe with QA pairs
df2 = pd.DataFrame(zip(quesListUpd,ansListUpd),columns=['Question',link])
df2

Unnamed: 0,Question,https://www.myautismteam.com/resources/autism-an-overview
0,How Is Autism Diagnosed?,There is no one single conclusive test for aut...
1,How Is Autism Treated?,Treatment for autism depends largely on the in...
2,Is there a cure for autism?,"There is no cure for autism. However, early in..."
3,Is autism permanent?,There is some controversy on the topic of whet...
4,Is autism contagious?,Autism is not a contagious condition. Autism i...
5,Are rates of autism increasing?,Estimates released by the Centers for Disease ...
6,Is autism a new condition?,"It is likely that autism has always existed, b..."


In [55]:
# Concatenate existing and new dataframes
df3 = pd.concat([df,df2],axis=0)
df3 = df3.sort_values(by=list(df3.columns[1:])).reset_index(drop=True)
df3.head(15)

Unnamed: 0,Question,https://birchtreecenter.org/learn/autism,https://www.myautismteam.com/resources/autism-an-overview
0,What are the Autism Spectrum Disorders (ASD)?,ASD refers to a wide spectrum of neurodevelopm...,
1,How common is autism?,According to a 2020 report commissioned by the...,It is estimated that in the United States 1.6 ...
2,What causes autism? Can it be cured?,The causes of this complex disorder remain unc...,
3,Is autism contagious?,,Autism is not a contagious condition. Autism i...
4,Are rates of autism increasing?,,Estimates released by the Centers for Disease ...
5,Is autism a new condition?,,"It is likely that autism has always existed, b..."
6,Is there a cure for autism?,,"There is no cure for autism. However, early in..."
7,How Is Autism Diagnosed?,,There is no one single conclusive test for aut...
8,Is autism permanent?,,There is some controversy on the topic of whet...
9,How Is Autism Treated?,,Treatment for autism depends largely on the in...


## **Save Output**

In [56]:
df3.to_excel('ASDquestions2.xlsx',index=False)