# Business Analysis based on 192.com listings

The following workbook 

+ takes saved 192.com webpages, 
+ extracts relevant data, 
+ removes partial / missing data, and
+ saves a csv file of the results, ready for further analysis.

## requirements:

1 or more html pages saved in the same directory as this notebook (recursive searching not implemented)

Successfully tested in Ubuntu 19.04 and Windows 10

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from os import listdir

In [2]:
out_csv='listings.csv'

#create a list of all filenames that have an extension of '.'
pages = [f for f in listdir('.') if f.split('.')[-1]=='html']

In [3]:
listOfLists=[]
for page in pages:
    soup = BeautifulSoup(open(page).read(), "html.parser")
    listings  = soup.findAll('li',{}) # *almost* everything in a list is a business
    for listing in listings:
        innerList=[]
        business = listing.find('h3')
        sector   = listing.find('div', {'class':'test-ont-business-result-market-sector'})
        address  = listing.find('div', {'class':'test-ont-business-result-address'})
        if business != None:
            innerList.append(business.text)
            if sector != None:
                innerList.append(sector.text)
            if address != None:
                innerList.append(address.text)
                innerList.append(address.text.split(',')[-1]) # store the postcode separately
            listOfLists.append(innerList)


In [4]:
df=pd.DataFrame(listOfLists, columns=["Business", "Sector", "full_addr", "Postcode"])
df.to_csv('Temp.csv')
print(df.shape)
#no duplicates, or blanks
df.replace('', np.nan, inplace=True) 
# data is is 90% empty strings, we can't use partial data, so need to delete it
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

#BB11 postcodes only
burnley_only=df[df['Postcode'].str.contains('BB11')] # blunt, may nneed tweaking.

# remove Manufacturers and Suppliers and Headquarters
no_s=~df['Sector'].str.contains('Supplier')      #No suppliers
no_m=~df['Sector'].str.contains('Manufacture')   #No manufacturers
no_h=~df['Business'].str.contains('Headquarters')#No headquarters

no_smh=no_s&no_m&no_h
cust_only=burnley_only[no_smh]

cust_only.shape



(2445, 4)




(265, 4)

In [5]:
# our index is meaningless now that we have dropped multiple groups of data.  
cust_only.to_csv(out_csv, index=False)
