## Webscraping Beautiful Soup+Regex

The purpose of this exercise is to use *beautifulsoup* to extract out information from this article: https://www.usatoday.com/story/money/business/2018/09/13/mcdonalds-states-most-stores/37748287/.

Unfortunately, the information we want is not stored in a table and is not formatted in a way that makes it easy to extract, so it will take some work before you can do any analysis.

Your objective is to create a pandas DataFrame containing all 50 states and the four metrics from the article (number of McDonald's per 100,000, adult obesity rate, percent consuming vegetables less than daily, and median household income).

findAll --> class 'bs4.element.ResultSet'

findAll[0] --> class 'bs4.element.Tag

findAll[0].get('alt',default='No Class') --> text of that class

findAll[0].text --> text 

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

**Step 1: Use _requests_ to fetch the contents of the article and convert to soup with _BeautifulSoup_**

In [2]:
response = requests.get('https://www.usatoday.com/story/money/business/2018/09/13/mcdonalds-states-most-stores/37748287/')
soup = BeautifulSoup(response.text)

### Step 2: Extract State Names

**A.** Using whatever method you would like, extract out the states as a list named `states`. Do this in the same order that they appear in the article.

In [3]:
states = soup.findAll('h3', attrs={'class' : 'gnt_ar_b_h2'})
states = states [:-1]
states = [state.text for state in states]
states = [re.sub('\d*\.\s',"",state) for state in states]

**B.** Now, extract the other four variables as lists named `McD`, `obesity`, `veggies`, and `income`. Make sure that they are in the same order as states.

In [4]:
mc = soup.findAll('p', attrs={'class' : 'gnt_ar_b_p'})
mc = [j.text for j in mc]
mc = [x for x in mc if "No. of McDonald's" in x]
McD = [float(re.findall('(\d\.\d) per 100,000',a)[0]) for a in mc]
obe = [float(re.findall('Adult obesity rate: (\d+\.\d) percent',a)[0]) for a in mc]
veg = [float(re.findall('than daily: (\d+\.\d) percent',a)[0]) for a in mc]
inc = [re.findall('income: \$(\d+\,\d+)',a)[0] for a in mc]
inc = [int(i.replace(',', '')) for i in inc]

### Step 3: Convert the Result to a pandas DataFrame

Once you have created a DataFrame, take a look at the results and see if there are any significant correlations between the variables.

In [6]:
list_of_tuples = list(zip(states, McD, obe,veg,inc)) 
clean_data = pd.DataFrame(list_of_tuples, columns = ['States','Num_of_McD', 'Obesity_Rate','Veg_Consumption_Rate','Median_Income'])
clean_data.head(5)

Unnamed: 0,States,Num_of_McD,Obesity_Rate,Veg_Consumption_Rate,Median_Income
0,Rhode Island,2.9,27.2,23.5,60596
1,New Jersey,3.0,25.9,22.1,76126
2,New York,3.1,25.0,22.4,62909
3,California,3.3,22.7,18.6,67739
4,North Dakota,3.3,31.8,27.5,60656
