# Webscraping and Text Manipulation: Statistics in Presidential Debates

Scrape Presidential Debates from the Commission of Presidential Debates website: http://www.debates.org/index.php?page=debate-transcripts.

1. By using `requests` and `BeautifulSoup` , I found all the links / URLs on the website that links to transcriptions of **First Presidential Debates** from the years [2012, 2008, 2004, 2000, 1996, 1988, 1984, 1976, 1960]. 

2. Report:
    1. The length of the transcript of the debate (as in the number of characters in transcription string).
    2. Count how many times the word **war** was used in the different debates.
    3. Scrape the most common used word in the debate, and how many times it was used.

In [1]:
import requests
import bs4 as bs
import pandas as pd
import re

In [2]:
source = requests.get('http://www.debates.org/index.php?page=debate-transcripts').content

In [3]:
soup = bs.BeautifulSoup(source,'html.parser')

In [4]:
content = soup.find(id='content-sm')

In [5]:
firstdebatelink=[]

In [6]:
for link in content.findAll('a'):
    if 'The First' in link.string:
        firstdebatelink.append(link.get('href'))

In [7]:
firstdebatelink

['http://www.debates.org/index.php?page=october-3-2012-debate-transcript',
 'http://www.debates.org/index.php?page=2008-debate-transcript',
 'http://www.debates.org/index.php?page=september-30-2004-debate-transcript',
 'http://www.debates.org/index.php?page=october-3-2000-transcript',
 'http://www.debates.org/index.php?page=october-6-1996-debate-transcript',
 'http://www.debates.org/index.php?page=september-25-1988-debate-transcript',
 'http://www.debates.org/index.php?page=october-7-1984-debate-transcript',
 'http://www.debates.org/index.php?page=september-23-1976-debate-transcript',
 'http://www.debates.org/index.php?page=september-26-1960-debate-transcript']

In [8]:
for link in content.findAll('a'):
    if 'The First' in link.string:
        print(link.get('href'))

http://www.debates.org/index.php?page=october-3-2012-debate-transcript
http://www.debates.org/index.php?page=2008-debate-transcript
http://www.debates.org/index.php?page=september-30-2004-debate-transcript
http://www.debates.org/index.php?page=october-3-2000-transcript
http://www.debates.org/index.php?page=october-6-1996-debate-transcript
http://www.debates.org/index.php?page=september-25-1988-debate-transcript
http://www.debates.org/index.php?page=october-7-1984-debate-transcript
http://www.debates.org/index.php?page=september-23-1976-debate-transcript
http://www.debates.org/index.php?page=september-26-1960-debate-transcript


In [9]:
title=[]

In [10]:
for link in content.findAll('a'):
    if 'The First' in link.string:
        title.append(link.string)

In [11]:
title

['October 3, 2012: The First Obama-Romney Presidential Debate',
 'September 26, 2008: The First McCain-Obama Presidential Debate',
 'September 30, 2004: The First Bush-Kerry Presidential Debate',
 'October 3, 2000: The First Gore-Bush Presidential Debate',
 'October 6, 1996: The First Clinton-Dole Presidential Debate',
 'September 25, 1988: The First Bush-Dukakis Presidential Debate',
 'October 7, 1984: The First Reagan-Mondale Presidential Debate',
 'September 23, 1976: The First Carter-Ford Presidential Debate',
 'September 26, 1960: The First Kennedy-Nixon Presidential Debate']

In [12]:
df = pd.DataFrame(columns=title)

In [13]:
df

Unnamed: 0,"October 3, 2012: The First Obama-Romney Presidential Debate","September 26, 2008: The First McCain-Obama Presidential Debate","September 30, 2004: The First Bush-Kerry Presidential Debate","October 3, 2000: The First Gore-Bush Presidential Debate","October 6, 1996: The First Clinton-Dole Presidential Debate","September 25, 1988: The First Bush-Dukakis Presidential Debate","October 7, 1984: The First Reagan-Mondale Presidential Debate","September 23, 1976: The First Carter-Ford Presidential Debate","September 26, 1960: The First Kennedy-Nixon Presidential Debate"


In [14]:
nospacechars = []

In [15]:
for i in firstdebatelink:
    source = requests.get(i).content
    soup = bs.BeautifulSoup(source,'html.parser')
    content = soup.find(id='content-sm')
    count = content.find('p').text
    count = count.replace('\n', '')
    nospacechars.append(len(re.sub(r"\s+", "", count)))

In [16]:
nospacechars

[77994, 150409, 68161, 74689, 76428, 71975, 71293, 66121, 50001]

In [17]:
war_count=[]

In [18]:
for i in firstdebatelink:
    source = requests.get(i).content
    soup = bs.BeautifulSoup(source,'html.parser')
    content = soup.find(id='content-sm')
    count = content.find('p').text
    a = re.split(r'\W', count)
    b = a.count('wars')+a.count('Wars')+a.count('war')+a.count('War')
    war_count.append(b)

In [19]:
war_count

[5, 48, 64, 11, 15, 14, 3, 7, 3]

In [20]:
from collections import Counter

In [21]:
most_common_w=[]

In [22]:
for i in firstdebatelink:
    source = requests.get(i).content
    soup = bs.BeautifulSoup(source,'html.parser')
    content = soup.find(id='content-sm')
    count = content.find('p').text
    words = re.findall(r'\w+', count)
    cap_words = [word.upper() for word in words]
    word_counts = Counter(cap_words)
    mostcommon = word_counts.most_common(1)
    most_common_w.append(mostcommon[0][0])

In [23]:
most_common_w

['THE', 'THE', 'THE', 'THE', 'THE', 'THE', 'THE', 'THE', 'THE']

In [24]:
most_common_w_count=[]

In [25]:
for i in firstdebatelink:
    source = requests.get(i).content
    soup = bs.BeautifulSoup(source,'html.parser')
    content = soup.find(id='content-sm')
    count = content.find('p').text
    words = re.findall(r'\w+', count)
    cap_words = [word.upper() for word in words]
    word_counts = Counter(cap_words)
    mostcommon = word_counts.most_common(1)
    most_common_w_count.append(mostcommon[0][1])

In [26]:
most_common_w_count

[757, 1470, 857, 919, 876, 804, 867, 857, 779]

In [27]:
df.loc[1] = nospacechars
df.loc[2] = war_count
df.loc[3] = most_common_w
df.loc[4] = most_common_w_count

In [28]:
df['Name']=['Debate char length','war_count','most_common_w','most_common_w_count']

In [29]:
df.set_index('Name')

Unnamed: 0_level_0,"October 3, 2012: The First Obama-Romney Presidential Debate","September 26, 2008: The First McCain-Obama Presidential Debate","September 30, 2004: The First Bush-Kerry Presidential Debate","October 3, 2000: The First Gore-Bush Presidential Debate","October 6, 1996: The First Clinton-Dole Presidential Debate","September 25, 1988: The First Bush-Dukakis Presidential Debate","October 7, 1984: The First Reagan-Mondale Presidential Debate","September 23, 1976: The First Carter-Ford Presidential Debate","September 26, 1960: The First Kennedy-Nixon Presidential Debate"
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Debate char length,77994,150409,68161,74689,76428,71975,71293,66121,50001
war_count,5,48,64,11,15,14,3,7,3
most_common_w,THE,THE,THE,THE,THE,THE,THE,THE,THE
most_common_w_count,757,1470,857,919,876,804,867,857,779
