# Scraping The Wikipedia Page

## Goals
Two Wikipedia pages, each have links, I need to 
- scrape the two websites
- scrape all the links and for each link scrape the text
- put the data (with text) into a dataframe
- do a classification problem, 
    - clean the text
    - set target
    - build models to predict 
    - evaluate results (give top ten words)

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

In [13]:
url = "https://en.wikipedia.org/wiki/Category:Dislocations,_sprains_and_strains"

In [14]:
#url = "https://en.wikipedia.org/wiki/Category:Sports_injuries"

## Initial Scrape for links and titles 

In [None]:
#I used beautiful soup to scrape

In [15]:
page = requests.get(url)

In [16]:
print(page.status_code)

200


In [17]:
print(page.content)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Category:Dislocations, sprains and strains - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Category:Dislocations,_sprains_and_strains","wgTitle":"Dislocations, sprains and strains","wgCurRevisionId":833670789,"wgRevisionId":833670789,"wgArticleId":32571980,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Commons category link is on Wikidata","Musculoskeletal disorders","Injuries"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","M

In [18]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
#this made it easier to find what I need

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Category:Dislocations, sprains and strains - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Category:Dislocations,_sprains_and_strains","wgTitle":"Dislocations, sprains and strains","wgCurRevisionId":833670789,"wgRevisionId":833670789,"wgArticleId":32571980,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Commons category link is on Wikidata","Musculoskeletal disorders","Injuries"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","

In [19]:
finder_section = soup.find('div', class_='mw-category')

finder_all = finder_section.find_all('li')
print(finder_all)
#sectioning out the category names and the lists

[<li><a href="/wiki/Achilles_tendon_rupture" title="Achilles tendon rupture">Achilles tendon rupture</a></li>, <li><a href="/wiki/ALPSA_lesion" title="ALPSA lesion">ALPSA lesion</a></li>, <li><a href="/wiki/Anterior_cruciate_ligament_injury" title="Anterior cruciate ligament injury">Anterior cruciate ligament injury</a></li>, <li><a href="/wiki/Bankart_lesion" title="Bankart lesion">Bankart lesion</a></li>, <li><a href="/wiki/Biceps_femoris_tendon_rupture" title="Biceps femoris tendon rupture">Biceps femoris tendon rupture</a></li>, <li><a href="/wiki/Biceps_tendon_rupture" title="Biceps tendon rupture">Biceps tendon rupture</a></li>, <li><a href="/wiki/Cuboid_syndrome" title="Cuboid syndrome">Cuboid syndrome</a></li>, <li><a href="/wiki/Dislocated_shoulder" title="Dislocated shoulder">Dislocated shoulder</a></li>, <li><a href="/wiki/Dislocation_of_jaw" title="Dislocation of jaw">Dislocation of jaw</a></li>, <li><a href="/wiki/High_ankle_sprain" title="High ankle sprain">High ankle spr

In [20]:
finder = []
for li in finder_all:
    a_tag = li.find('a', href=True, attrs={'title':True, 'class':False}) # find a tags that have a title and a class
    href = a_tag['href'] # get the href attribute
    text = a_tag.getText() # get the text
    finder.append([href, text]) # append to list
print(finder)

[['/wiki/Achilles_tendon_rupture', 'Achilles tendon rupture'], ['/wiki/ALPSA_lesion', 'ALPSA lesion'], ['/wiki/Anterior_cruciate_ligament_injury', 'Anterior cruciate ligament injury'], ['/wiki/Bankart_lesion', 'Bankart lesion'], ['/wiki/Biceps_femoris_tendon_rupture', 'Biceps femoris tendon rupture'], ['/wiki/Biceps_tendon_rupture', 'Biceps tendon rupture'], ['/wiki/Cuboid_syndrome', 'Cuboid syndrome'], ['/wiki/Dislocated_shoulder', 'Dislocated shoulder'], ['/wiki/Dislocation_of_jaw', 'Dislocation of jaw'], ['/wiki/High_ankle_sprain', 'High ankle sprain'], ['/wiki/Hip_dislocation', 'Hip dislocation'], ['/wiki/Joint_dislocation', 'Joint dislocation'], ['/wiki/Knee_dislocation', 'Knee dislocation'], ['/wiki/Metatarsophalangeal_joint_sprain', 'Metatarsophalangeal joint sprain'], ['/wiki/Patellar_dislocation', 'Patellar dislocation'], ['/wiki/Patellar_tendon_rupture', 'Patellar tendon rupture'], ['/wiki/Perthes_lesion', 'Perthes lesion'], ['/wiki/Pulled_elbow', 'Pulled elbow'], ['/wiki/Pul

In [21]:
#made csv to save all files to
with open('DSS.csv', 'w') as f:
    for i in finder:
        f.write(",".join(i)+"\n")

In [9]:
#with open('SI.csv', 'w') as f:
#    for i in finder:
#        f.write(",".join(i)+"\n")

## Make into list/dataframe

In [3]:
import csv

with open('DSS.csv', 'r') as f:
  reader = csv.reader(f)
  DSS_list = list(reader)

In [4]:
print(DSS_list)

[['/wiki/Achilles_tendon_rupture', 'Achilles tendon rupture'], ['/wiki/ALPSA_lesion', 'ALPSA lesion'], ['/wiki/Anterior_cruciate_ligament_injury', 'Anterior cruciate ligament injury'], ['/wiki/Bankart_lesion', 'Bankart lesion'], ['/wiki/Biceps_femoris_tendon_rupture', 'Biceps femoris tendon rupture'], ['/wiki/Biceps_tendon_rupture', 'Biceps tendon rupture'], ['/wiki/Cuboid_syndrome', 'Cuboid syndrome'], ['/wiki/Dislocated_shoulder', 'Dislocated shoulder'], ['/wiki/Dislocation_of_jaw', 'Dislocation of jaw'], ['/wiki/High_ankle_sprain', 'High ankle sprain'], ['/wiki/Hip_dislocation', 'Hip dislocation'], ['/wiki/Joint_dislocation', 'Joint dislocation'], ['/wiki/Knee_dislocation', 'Knee dislocation'], ['/wiki/Metatarsophalangeal_joint_sprain', 'Metatarsophalangeal joint sprain'], ['/wiki/Patellar_dislocation', 'Patellar dislocation'], ['/wiki/Patellar_tendon_rupture', 'Patellar tendon rupture'], ['/wiki/Perthes_lesion', 'Perthes lesion'], ['/wiki/Pulled_elbow', 'Pulled elbow'], ['/wiki/Pul

In [5]:
with open('SI.csv', 'r') as f:
  reader = csv.reader(f)
  SI_list = list(reader)

In [6]:
print(SI_list)

[['/wiki/Sports_injury', 'Sports injury'], ['/wiki/Achilles_tendon_rupture', 'Achilles tendon rupture'], ['/wiki/Biceps_tendon_rupture', 'Biceps tendon rupture'], ['/wiki/Boxer%27s_fracture', "Boxer's fracture"], ['/wiki/Cauliflower_ear', 'Cauliflower ear'], ['/wiki/Chronic_traumatic_encephalopathy', 'Chronic traumatic encephalopathy'], ['/wiki/Concussions_in_Australian_sport', 'Concussions in Australian sport'], ['/wiki/Concussions_in_high_school_sports', 'Concussions in high school sports'], ['/wiki/Concussions_in_rugby_union', 'Concussions in rugby union'], ['/wiki/Injured_list', 'Injured list'], ['/wiki/Footballer%27s_ankle', "Footballer's ankle"], ['/wiki/Golfer%27s_elbow', "Golfer's elbow"], ['/wiki/Health_issues_in_American_football', 'Health issues in American football'], ['/wiki/Helmet_removal_(sports)', 'Helmet removal (sports)'], ['/wiki/Helmet-to-helmet_collision', 'Helmet-to-helmet collision'], ['/wiki/Injured_reserve_list', 'Injured reserve list'], ['/wiki/Knee_dislocatio

for i,j in SI_list:
print i

for each line in i:
page = requests.get(i)
print(page.content)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

need to add https://en.wikipedia.org/ to everything 

finder_section = soup.find('div', class_='mw-category')

finder_all = finder_section.find_all('li')
print(finder_all)

finder = []
for li in finder_all:
    a_tag = li.find('a', href=True, attrs={'title':True, 'class':False}) # find a tags that have a title and a class
    href = a_tag['href'] # get the href attribute
    text = a_tag.getText() # get the text
    finder.append([href, text]) # append to array
print(finder)

In [15]:
#def gettext():
#    for i in SI_list:
#        page = requests.get(i)
#        print(page.content)
#gettext()

In [31]:
len(DSS_list)

31

In [47]:
len(SI_list)

25

In [7]:
SI_listWiki=[]

for i, j in SI_list:
    SI_listWiki.append(['https://en.wikipedia.org/' + i, j, 'SI'])
    
#SI_listWiki

In [None]:
#added wikipedia labels and the target section

In [8]:
DSS_listWiki=[]

for i, j in DSS_list:
    DSS_listWiki.append(['https://en.wikipedia.org/' + i, j, 'DSS'])
    
#DSS_listWiki

In [9]:
listdata = DSS_listWiki + SI_listWiki
listdata
#put both of these in the same list

[['https://en.wikipedia.org//wiki/Achilles_tendon_rupture',
  'Achilles tendon rupture',
  'DSS'],
 ['https://en.wikipedia.org//wiki/ALPSA_lesion', 'ALPSA lesion', 'DSS'],
 ['https://en.wikipedia.org//wiki/Anterior_cruciate_ligament_injury',
  'Anterior cruciate ligament injury',
  'DSS'],
 ['https://en.wikipedia.org//wiki/Bankart_lesion', 'Bankart lesion', 'DSS'],
 ['https://en.wikipedia.org//wiki/Biceps_femoris_tendon_rupture',
  'Biceps femoris tendon rupture',
  'DSS'],
 ['https://en.wikipedia.org//wiki/Biceps_tendon_rupture',
  'Biceps tendon rupture',
  'DSS'],
 ['https://en.wikipedia.org//wiki/Cuboid_syndrome', 'Cuboid syndrome', 'DSS'],
 ['https://en.wikipedia.org//wiki/Dislocated_shoulder',
  'Dislocated shoulder',
  'DSS'],
 ['https://en.wikipedia.org//wiki/Dislocation_of_jaw',
  'Dislocation of jaw',
  'DSS'],
 ['https://en.wikipedia.org//wiki/High_ankle_sprain',
  'High ankle sprain',
  'DSS'],
 ['https://en.wikipedia.org//wiki/Hip_dislocation', 'Hip dislocation', 'DSS'],
 

## trying to break up the list into data frame

In [10]:
def get_text(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    finder_section = soup.find('div', class_='mw-content-ltr')
    finder_all = finder_section.find_all('p')
    
    #texts = ' '.join(finder_all)
    #texts = ' '.join(item.text for item in finder_all)
    
    
    texts = ''
    
    for item in finder_all:
         texts += ' ' + item.text
        
    return texts

In [11]:
def get_data(url, title, target):
    return [title, url, target, get_text(url)]

In [14]:
finalList = []

for item in listdata:
    url = item[0]
    title = item[1]
    target = item[2]
    
    finalList.append(get_data(url, title, target))

In [15]:
print(finalList)

[['Achilles tendon rupture', 'https://en.wikipedia.org//wiki/Achilles_tendon_rupture', 'DSS', " Achilles tendon rupture is when the Achilles tendon, at the back of the ankle, breaks.[5] Symptoms include the sudden onset of sharp pain in the heel.[3] A snapping sound may be heard as the tendon breaks and walking becomes difficult.[4]\n Rupture typically occurs as a result of a sudden bending up of the foot when the calf muscle is engaged, direct trauma, or long-standing tendonitis.[4][5] Other risk factors include the use of fluoroquinolones, a significant change in exercise, rheumatoid arthritis, gout, or corticosteroid use.[1][5] Diagnosis is typically based on symptoms and examination and supported by medical imaging.[5]\n Prevention may include stretching before activity.[4] Treatment may be by surgery or casting with the toes somewhat pointed down.[6][5][2] Relatively rapid return to weight bearing (within 4 weeks) appears okay.[6][7] The risk of re-rupture is about 25% with castin

In [None]:
#listdata

#finalList = [get_data(url, title, target)
             #for (url, title, target) in lst_urls_etc]

In [None]:
#below is a test

In [9]:
get_text('https://en.wikipedia.org//wiki/Achilles_tendon_rupture',
  'Achilles tendon rupture')

" Achilles tendon rupture is when the Achilles tendon, at the back of the ankle, breaks.[5] Symptoms include the sudden onset of sharp pain in the heel.[3] A snapping sound may be heard as the tendon breaks and walking becomes difficult.[4]\n Rupture typically occurs as a result of a sudden bending up of the foot when the calf muscle is engaged, direct trauma, or long-standing tendonitis.[4][5] Other risk factors include the use of fluoroquinolones, a significant change in exercise, rheumatoid arthritis, gout, or corticosteroid use.[1][5] Diagnosis is typically based on symptoms and examination and supported by medical imaging.[5]\n Prevention may include stretching before activity.[4] Treatment may be by surgery or casting with the toes somewhat pointed down.[6][5][2] Relatively rapid return to weight bearing (within 4 weeks) appears okay.[6][7] The risk of re-rupture is about 25% with casting.[5] If appropriate treatment does not occur within 4 weeks of the injury outcomes are not as

## Make the DF

In [16]:
import pandas as pd 

df = pd.DataFrame(finalList, columns = ['Title', 'Url', 'Type', 'Text' ]) 

In [20]:
df.head(3)

Unnamed: 0,Title,Url,Type,Text
0,Achilles tendon rupture,https://en.wikipedia.org//wiki/Achilles_tendon...,DSS,Achilles tendon rupture is when the Achilles ...
1,ALPSA lesion,https://en.wikipedia.org//wiki/ALPSA_lesion,DSS,An ALPSA (anterior labral periosteal sleeve a...
2,Anterior cruciate ligament injury,https://en.wikipedia.org//wiki/Anterior_crucia...,DSS,Anterior cruciate ligament injury is when the...


In [27]:
csv_data = df.to_csv(index=False)

In [29]:
df.to_csv('injury.csv', index=False)

In [30]:
df2 = pd.read_csv("injury.csv") 

In [32]:
df2.head(5)

Unnamed: 0,Title,Url,Type,Text
0,Achilles tendon rupture,https://en.wikipedia.org//wiki/Achilles_tendon...,DSS,Achilles tendon rupture is when the Achilles ...
1,ALPSA lesion,https://en.wikipedia.org//wiki/ALPSA_lesion,DSS,An ALPSA (anterior labral periosteal sleeve a...
2,Anterior cruciate ligament injury,https://en.wikipedia.org//wiki/Anterior_crucia...,DSS,Anterior cruciate ligament injury is when the...
3,Bankart lesion,https://en.wikipedia.org//wiki/Bankart_lesion,DSS,A Bankart lesion is an injury of the anterior...
4,Biceps femoris tendon rupture,https://en.wikipedia.org//wiki/Biceps_femoris_...,DSS,Biceps femoris tendon rupture can occur when ...


# END OF PART 1

What I could have done better:
1) Use wikipedia API
2) Use list comprehension 
3) explore using storage that is not a csv

## Extra what we broke down above

In [99]:
lst = ['one', 'two', 'three']
','.join(lst)

'one,two,three'

In [85]:
results = []
for i,j in DSS_listWiki:
    page = requests.get(i)
    results.append(page)
print(results)

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]


In [88]:
souplist = [BeautifulSoup(page.content, 'html.parser') for page in results]
print(souplist[-1].prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Whiplash (medicine) - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Whiplash_(medicine)","wgTitle":"Whiplash (medicine)","wgCurRevisionId":909432479,"wgRevisionId":909432479,"wgArticleId":234482,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with unsourced statements","Articles with unsourced statements from May 2010","Wikipedia articles needing clarification from June 2015","Articles with unsourced statements from February 2013","Contortion","Injuries of neck","Law of negligence","Dislocations, sprains and strains"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext",

In [91]:
finder_section = [soup.find('div', class_='mw-content-ltr') for soup in souplist]

finder_all = finder_section.find_all('p')
print(finder_all)

AttributeError: 'list' object has no attribute 'find_all'

## List comprehension way to do what we did above

In [65]:
page = [requests.get(i).text for i,j in DSS_listWiki]
page

['<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Achilles tendon rupture - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Achilles_tendon_rupture","wgTitle":"Achilles tendon rupture","wgCurRevisionId":908759881,"wgRevisionId":908759881,"wgArticleId":2186340,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 German-language sources (de)","Articles with short description","All articles with unsourced statements","Articles with unsourced statements from August 2015","Articles needing additional references from January 2012","All articles needing additional references","RTT","Dislocations, sprains and strains","Sports injuries"],"wgBreakFrames":!1,"wgPageContentLanguage"

## this just prints the last part (do not use)

In [92]:
for i,j in DSS_listWiki:
    page = requests.get(i)
print(page.content)



In [93]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Whiplash (medicine) - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Whiplash_(medicine)","wgTitle":"Whiplash (medicine)","wgCurRevisionId":909432479,"wgRevisionId":909432479,"wgArticleId":234482,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with unsourced statements","Articles with unsourced statements from May 2010","Wikipedia articles needing clarification from June 2015","Articles with unsourced statements from February 2013","Contortion","Injuries of neck","Law of negligence","Dislocations, sprains and strains"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext",

In [95]:
finder_section = soup.find('div', class_='mw-content-ltr')

finder_all = finder_section.find_all('p')
print(finder_all)

[<p><b>Whiplash</b>  is a non-medical term describing a range of <a href="/wiki/Injury" title="Injury">injuries</a> to the <a href="/wiki/Neck" title="Neck">neck</a> caused by or related to a sudden distortion of the neck<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> associated with <a href="/wiki/Anatomical_terms_of_motion#Flexion_and_extension" title="Anatomical terms of motion">extension</a>,<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> although the exact injury mechanisms remain unknown. The term "whiplash" is a <a href="/wiki/Colloquialism" title="Colloquialism">colloquialism</a>. "Cervical acceleration–deceleration" (CAD) describes the mechanism of the injury, while the term "whiplash associated disorders" (WAD) describes the injury sequelae and symptoms.
</p>, <p>Whiplash is commonly associated with <a class="mw-redirect" href="/wiki/Car_accident" title="Car accident">motor vehicle accidents</a>, usually when the vehicle

In [96]:
for item in finder_all:
    print(item.text)

Whiplash  is a non-medical term describing a range of injuries to the neck caused by or related to a sudden distortion of the neck[1] associated with extension,[2] although the exact injury mechanisms remain unknown. The term "whiplash" is a colloquialism. "Cervical acceleration–deceleration" (CAD) describes the mechanism of the injury, while the term "whiplash associated disorders" (WAD) describes the injury sequelae and symptoms.

Whiplash is commonly associated with motor vehicle accidents, usually when the vehicle has been hit in the rear;[3] however, the injury can be sustained in many other ways, including headbanging,[4] bungee jumping and falls.[5] It is one of the most frequently claimed injuries on vehicle insurance policies in certain countries; for example, in the United Kingdom 430,000 people made an insurance claim for whiplash in 2007, accounting for 14% of every driver's premium.[6]

Before the invention of the car, whiplash injuries were called "railway spine" as they 

## End of useless code