This data will scrape the web source to retrieve the locations of 26 US tech hub based on the article here: https://www.zdnet.com/education/computers-tech/top-tech-hubs-in-the-us/.


#Importing Packages

In [124]:
#import packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

#Scraping the Data

In [125]:
#Request from website
url = "https://www.zdnet.com/education/computers-tech/top-tech-hubs-in-the-us/"
response = requests.get(url)

In [126]:
#Parse html
soup = BeautifulSoup(response.text, 'html.parser')

In [127]:
#Find all cities identified
techCities = soup.find_all('h3')

In [128]:
#Define blank table to save cities into
cityState = []

In [129]:
#For each identified city, add to list
for city in techCities:
    cityName = city.get_text()
    if cityName:
        cityState.append(cityName)

In [130]:
#Define dataframe
techData = pd.DataFrame(cityState)
techData = techData.rename(columns={0: "City, State"})

In [131]:
#Preview Uncleaned Data
techData.head(10)

Unnamed: 0,"City, State"
0,"Atlanta, GA"
1,"Austin, TX"
2,"Baltimore, MD"
3,"Boston, MA"
4,"Burlington, VT"
5,"Charlotte, NC"
6,"Chicago, IL"
7,"Cleveland, OH"
8,"Columbus, OH"
9,Dallas-Ft. Worth


#Cleaning the Data

In [132]:
#Split city and state on ","
techData[["City","State"]]= techData["City, State"].str.split(pat = ', ',expand = True)

In [133]:
#Remove original city,state column
techData = techData.drop(["City, State"], axis = 1)# Removes unnecessary columns

In [134]:
#Review Data set
techData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   City    25 non-null     object
 1   State   24 non-null     object
dtypes: object(2)
memory usage: 528.0+ bytes


In [135]:
#Notice Dallas-Fort Worth do not have Texas as the associated state.
#Remove this combined row and append to records for the separate cities
techData = techData.drop([9], axis = 0)
updatedCities = pd.DataFrame([["Dallas", "TX"],["Fort Worth","TX"]], columns = ["City","State"])
techData = pd.concat([techData,updatedCities], ignore_index = True)

In [136]:
#Trim spaces off of state codes
for i in range(0,len(techData["State"])):
    stateClean = techData["State"][i][:2]#Remove hidden character at end of state
    techData["State"][i] = stateClean

In [137]:
#Review Region data to merge into metroData
regions = pd.read_csv('https://raw.githubusercontent.com/cphalpert/census-regions/master/us%20census%20bureau%20regions%20and%20divisions.csv')
regions = pd.DataFrame(regions)
regions.head()

Unnamed: 0,State,State Code,Region,Division
0,Alaska,AK,West,Pacific
1,Alabama,AL,South,East South Central
2,Arkansas,AR,South,West South Central
3,Arizona,AZ,West,Mountain
4,California,CA,West,Pacific


In [138]:
#Merge Region data
techData = techData.merge(regions, left_on='State', right_on='State Code')
techData = techData.drop(["State Code"], axis = 1)
techData = techData.rename(columns={"State_x": "State Code", "State_y": "State Name"})

techData.head(10)

Unnamed: 0,City,State Code,State Name,Region,Division
0,Atlanta,GA,Georgia,South,South Atlantic
1,Austin,TX,Texas,South,West South Central
2,Houston,TX,Texas,South,West South Central
3,Dallas,TX,Texas,South,West South Central
4,Fort Worth,TX,Texas,South,West South Central
5,Baltimore,MD,Maryland,South,South Atlantic
6,Boston,MA,Massachusetts,Northeast,New England
7,Burlington,VT,Vermont,Northeast,New England
8,Charlotte,NC,North Carolina,South,South Atlantic
9,Chicago,IL,Illinois,Midwest,East North Central


In [139]:
techData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26 entries, 0 to 25
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   City        26 non-null     object
 1   State Code  26 non-null     object
 2   State Name  26 non-null     object
 3   Region      26 non-null     object
 4   Division    26 non-null     object
dtypes: object(5)
memory usage: 1.2+ KB


In [140]:
#Export techData as .csv
techData.to_csv('techData.csv',index = False)

##Exploratory Analysis

The Tech hubs data set identifies 26 tech hubs across the United States. If breweries thrive in places where the technology industry is stong, this will help us identify correlation between brewery hotspots and tech hubs.

Let's review the distribution of these tech hubs across the country and then map them to see the distribution visually.

In [141]:
#Count by region
techDataByRegions = techData.value_counts("Region").sort_index()
techDataByRegions


Region
Midwest      6
Northeast    4
South        9
West         7
dtype: int64

In [142]:
labels = ["Midwest","Northeast","South","West"]

# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=techDataByRegions[0:4], name="Tech Hubs"),
              1, 1)
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name+value")

fig.update_layout(
    title_text="Proportion of Tech Hubs by Region")
fig.show()


In [143]:
#Count by state
techDataByState = pd.DataFrame(techData.value_counts("State Code"))
techDataByState = techDataByState.rename( columns = {0:"Total Tech Hubs"})
techDataByState

Unnamed: 0_level_0,Total Tech Hubs
State Code,Unnamed: 1_level_1
TX,4
OH,2
CA,2
AZ,1
NC,1
VT,1
UT,1
TN,1
PA,1
OR,1


It looks like most the South has the most tech hubs, followed by the West and Midwest, and finally the Northeast. It would be helpful to review this visually so that we can know where to identify brewery hotspots. We can build the following map.

In [144]:
#Choropleth
import plotly.express as px

fig = px.choropleth(techDataByState, locations=techDataByState.index, locationmode="USA-states",
                    title = "Frequency of Tech Hubs by US State",
                    color = techDataByState["Total Tech Hubs"],
                    color_continuous_scale=px.colors.sequential.Pinkyl, scope="usa",
                    width=800, height=500)
fig.show()

This map demonstrates the spread of Tech Hubs across the country. Reasonably, tech hubs are centered in less rural areas of the country. Texas has the most hubs: Austin, Houstin, Dallas and Fort Worth. We may see some overlap between these tech hubs and other metropolitan cities.