# Crawling Wikipedia
## Using BeautifulSoup
### Connecting website and fetching data from it using Html tags

Required links
https://en.wikipedia.org/wiki/Main_Page
view-source:https://en.wikipedia.org/wiki/Main_Page
https://www.youtube.com/watch?v=hGXf0FsW0A4
https://www.youtube.com/watch?v=wYhjXLbNmho

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = "https://en.wikipedia.org/wiki/Main_Page"
conn = urlopen(url)
pageData = conn.read()
conn.close()

print(type(pageData))

try:
    bs = BeautifulSoup(pageData, "lxml")
except:
    bs = BeautifulSoup(pageData, "html5lib")
print(type(bs))
print(bs.title)
print(bs.h1)
print(bs.find("h1", {"id": "firstHeading"}).string)
print(bs.find("div", {"id": "siteSub"}).text)

<class 'bytes'>
<class 'bs4.BeautifulSoup'>
<title>Wikipedia, the free encyclopedia</title>
<h1 class="firstHeading" id="firstHeading" lang="en">Main Page</h1>
Main Page
From Wikipedia, the free encyclopedia


### Tags

In [68]:
# tag and its name
tag = bs.h1
print(tag)
print(type(tag))
print(tag.name)
print("==============================")

# attributes of tag
print(tag.attrs)
print(tag['class'])
print(tag.get('id'))

<h1 class="firstHeading" id="firstHeading" lang="en">Main Page</h1>
<class 'bs4.element.Tag'>
h1
{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}
['firstHeading']
firstHeading


We can add, remove, and modify a tag’s name and attributes

In [69]:
# modifying tag type
tag.name = "div"

print(tag)
print(tag.name)
print("==============================")

print(tag.attrs)
print("==============================")

# modifying a attribute
tag['id'] = 'yatin'
print(tag.attrs)
print("==============================")

# adding a new attribute
tag['xyz'] = 'abc'
print(tag.attrs)
print("==============================")

# deleting an attribute
del tag['xyz']
print(tag.attrs)
print("==============================")


<div class="firstHeading" id="firstHeading" lang="en">Main Page</div>
div
{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}
{'id': 'yatin', 'class': ['firstHeading'], 'lang': 'en'}
{'id': 'yatin', 'class': ['firstHeading'], 'lang': 'en', 'xyz': 'abc'}
{'id': 'yatin', 'class': ['firstHeading'], 'lang': 'en'}


### Fetching multiple links and putting in csv

In [70]:
headers = "Title, Link, Value\n"
filename = "scrapped.csv"
f = open(filename, "w")
f.write(headers)

bodyContent = bs.find("div", {"id": "bodyContent"})
links = bodyContent.findAll("a", href=re.compile("(/wiki/)+([A-Za-z0-9_:()])"))

for link in links:
    print(link.get('title') + "," + link.get('href')  + "\n")
    try:
        f.write(link.get('title') + "," + link.get('href')  + "\n")
    except TypeError:
        continue
    except UnicodeEncodeError:
        continue

f.close()

Wikipedia,/wiki/Wikipedia

Free content,/wiki/Free_content

Encyclopedia,/wiki/Encyclopedia

Wikipedia:Introduction,/wiki/Wikipedia:Introduction

Special:Statistics,/wiki/Special:Statistics

English language,/wiki/English_language

Portal:Arts,/wiki/Portal:Arts

Portal:Biography,/wiki/Portal:Biography

Portal:Geography,/wiki/Portal:Geography

Portal:History,/wiki/Portal:History

Portal:Mathematics,/wiki/Portal:Mathematics

Portal:Science,/wiki/Portal:Science

Portal:Society,/wiki/Portal:Society

Portal:Technology,/wiki/Portal:Technology

Portal:Contents/Portals,/wiki/Portal:Contents/Portals

Eva Perón (1919–1952),/wiki/File:Eva_Per%C3%B3n_Retrato_Oficial.jpg

Evita (1996 film),/wiki/Evita_(1996_film)

Musical film,/wiki/Musical_film

Drama (film and television),/wiki/Drama_(film_and_television)

Evita (album),/wiki/Evita_(album)

Tim Rice,/wiki/Tim_Rice

Andrew Lloyd Webber,/wiki/Andrew_Lloyd_Webber

Evita (musical),/wiki/Evita_(musical)

Eva Perón,/wiki/Eva_Per%C3%B3n

Alan Parker,/wi

### Converting the HTML to readable and structured format

In [71]:
print(bs)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":847600508,"wgRevisionId":847600508,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShor

In [72]:
print(bs.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia, the free encyclopedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":847600508,"wgRevisionId":847600508,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","D

### Downloading Youtube Video

In [3]:
#! pip install pytube

from pytube import YouTube

YouTube('https://www.youtube.com/watch?v=hGXf0FsW0A4').streams.first().download('./videos')
YouTube('https://www.youtube.com/watch?v=wYhjXLbNmho').streams.first().download('./videos')


In [93]:
# with mp4 format
from pytube import YouTube

YouTube('https://www.youtube.com/watch?v=wYhjXLbNmho').streams.filter(subtype='mp4').first().download('./videos')

In [14]:

from pytube import Playlist

pl = Playlist("https://www.youtube.com/playlist?list=PLGVShb98UkHiaI0nkyv4EfoP5InL2qVdv")
#
pl.download_all('./songs')

In [None]:
from pytube import Playlist

pl = Playlist("https://www.youtube.com/playlist?list=PLGVShb98UkHif1zV8z9H7aR-H9GZJvuV6")
#https://www.youtube.com/playlist?list=PLGVShb98UkHjGmUEg0L6-AaEfcl_MtnC9
pl.download_all('./songs')

In [15]:
from pytube import Playlist

pl = Playlist("https://www.youtube.com/playlist?list=PLGVShb98UkHjGmUEg0L6-AaEfcl_MtnC9")
#
pl.download_all('./songs')

### Converting video to audio using moviepy

In [16]:
# ! pip install moviepy
import moviepy.editor as mp
clip = mp.VideoFileClip("./songs/")
clip.audio.write_audiofile("./videos/")

OSError: MoviePy error: failed to read the duration of file ./songs/.
Here are the file infos returned by ffmpeg:

ffmpeg version 3.2.4 Copyright (c) 2000-2017 the FFmpeg developers
  built with gcc 6.3.0 (GCC)
  configuration: --enable-gpl --enable-version3 --enable-d3d11va --enable-dxva2 --enable-libmfx --enable-nvenc --enable-avisynth --enable-bzlib --enable-fontconfig --enable-frei0r --enable-gnutls --enable-iconv --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libfreetype --enable-libgme --enable-libgsm --enable-libilbc --enable-libmodplug --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenh264 --enable-libopenjpeg --enable-libopus --enable-librtmp --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs --enable-libxvid --enable-libzimg --enable-lzma --enable-zlib
  libavutil      55. 34.101 / 55. 34.101
  libavcodec     57. 64.101 / 57. 64.101
  libavformat    57. 56.101 / 57. 56.101
  libavdevice    57.  1.100 / 57.  1.100
  libavfilter     6. 65.100 /  6. 65.100
  libswscale      4.  2.100 /  4.  2.100
  libswresample   2.  3.100 /  2.  3.100
  libpostproc    54.  1.100 / 54.  1.100
./songs/: Permission denied


### Speech Recognition

Speech Recognition is wrapper for multiple speech recognition APIs, some does sentiment analysis as well with speech recognition like wit.ai and some does only speech recognition like Sphinx
It can also be combined with tensorflow

In [7]:
# ! pip install SpeechRecognition
# ! pip install PocketSphinx 

import speech_recognition as sr

r = sr.Recognizer()
with sr.AudioFile("./videos/english.wav") as source:
    audio = r.record(source)  # read the entire audio file
    
try:
    print(r.recognize_sphinx(audio))
except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))


one two three


In [2]:
# ! pip install pyaudio
import speech_recognition as sr

import pyaudio
pa = pyaudio.PyAudio()
print(pa.get_default_input_device_info())

print(pyaudio.pa.__file__)
print(sr.Microphone())
# obtain audio from the microphone
r = sr.Recognizer()
with sr.Microphone() as source:
    print("Say something!")
    r.adjust_for_ambient_noise(source)
    audio = r.listen(source)

# recognize speech using Sphinx
try:
    print(r.recognize_sphinx(audio))
except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))

{'index': 1, 'structVersion': 2, 'name': 'Microphone (3- Practica SP-USB2', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
D:\001_Work\Anaconda\lib\site-packages\_portaudio.cp36-win_amd64.pyd
<speech_recognition.Microphone object at 0x00000000056D4278>
Say something!
i know and clinton get bones had a remotely sticking out i tell you that i will shine on at two no studying it wasn't that i am totally got tied up in who the good deal yet but wonder what was it when it the oscars


#### References

BeautifulSoup<br/>
https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 

Pytube<br/>
https://python-pytube.readthedocs.io/en/latest/

Moviepy<br/>
https://zulko.github.io/moviepy/

SpeechRecognition<br/>
https://pypi.org/project/SpeechRecognition/

PyAudio<br/>
https://people.csail.mit.edu/hubert/pyaudio/docs/