### Imports

In [14]:
import pandas as pd
import numpy as np
from lxml import etree

### Flow

* Unzip etc using magics out to command line
* Read CSV as ASCII, show encoding errors
* Read as UTF8, show has dealt with them correctly
* Normalise names?
* Normalise eg postcode/split postcodes?
* Do some aggregations and grouping?
* Masks - ensure use different criteria to show usefulness of reusability
* Could we do something with PDF here? HTML pages might also be useful (download, unzip, process HTML)

### Preprocess and load reference data

In [15]:
#!unzip '/var/data/s2ds/companies/*.zip' -d /var/data/s2ds/companies/

In [16]:
def read_files(files):
    dfs = []

    for f in files:
        df = pd.read_csv(f)
        dfs.append(df[['CompanyName', ' CompanyNumber', 'RegAddress.PostCode']])

    return pd.concat(dfs)

In [None]:
How do we do this then?

In [17]:
def read_files_utf8(files):
    dfs = []

    for f in files:
        df = pd.read_csv(f, encoding='utf8')
        dfs.append(df[['CompanyName', ' CompanyNumber', 'RegAddress.PostCode']])

    return pd.concat(dfs)

In [83]:
files = !ls /var/data/s2ds/companies/*_5.csv

In [84]:
df = read_files(files)



In [28]:
df[df[' CompanyNumber'] == '02334804']

Unnamed: 0,CompanyName,CompanyNumber,RegAddress.PostCode
603065,NESTLÉ WATERS UK LIMITED,2334804,RH6 0PA


In [20]:
df.shape

(850000, 3)

In [21]:
df.head()

Unnamed: 0,CompanyName,CompanyNumber,RegAddress.PostCode
0,J.C. ROOFING SERVICES LTD,09176364,WD25 8BT
1,J.C. ROOK & SONS LIMITED,02117042,CT11 7DZ
2,J.C. ROXBURGH & COMPANY LIMITED,SC041244,G81 1LQ
3,J.C. ROXBURGH (INVESTMENTS) LIMITED,SC037420,G83 9LX
4,J.C. ROXBURGH PROPERTIES LIMITED,SC073801,G81 1LQ


In [85]:
df.rename(columns={' CompanyNumber': 'CompanyNumber'}, inplace=True)

In [86]:
df[df['CompanyNumber'] == '02334804']

Unnamed: 0,CompanyName,CompanyNumber,RegAddress.PostCode
603065,NESTLÉ WATERS UK LIMITED,2334804,RH6 0PA


In [104]:
df['clean_name'] = df['CompanyName'].str.lower()

### Preprocess and load UK IPO Office data

In [56]:
tree = etree.parse('/var/data/s2ds/trademarks/jnl.xml')

In [57]:
doc = tree.getroot()

In [98]:
applicants = [a.text.lower() for a in doc.findall('.//ApplicantName')]

In [105]:
trademarks = pd.DataFrame({'applicants': applicants})

In [106]:
trademarks = pd.merge(trademarks, df, how='left', left_on='applicants', right_on='clean_name')

In [107]:
trademarks.head()

Unnamed: 0,applicants,CompanyName,CompanyNumber,RegAddress.PostCode,clean_name
0,barefaced skincare limited,BAREFACED SKINCARE LIMITED,09525777,OX26 4LD,barefaced skincare limited
1,floreana ltd,FLOREANA LTD,SC432789,EH6 5SD,floreana ltd
2,floreana ltd,FLOREANA LTD,SC432789,EH6 5SD,floreana ltd
3,step2progress limited,STEP2PROGRESS LIMITED,09301218,SY1 3AF,step2progress limited
4,peter john savage,,,,


In [109]:
trademarks.clean_name.isnull().sum()

601

In [111]:
trademarks[trademarks.clean_name.isnull()].head(10)

Unnamed: 0,applicants,CompanyName,CompanyNumber,RegAddress.PostCode,clean_name
4,peter john savage,,,,
5,vincent dassault,,,,
6,ian redman,,,,
7,turbomed orthotics inc.,,,,
9,marek ryan larwood,,,,
17,péan & zakian llp,,,,
18,paul karaiskos,,,,
19,aran osman,,,,
20,barrington brown,,,,
21,joe bearmen,,,,


In [112]:
trademarks = trademarks[trademarks.clean_name.notnull()]

In [115]:
trademarks['RegAddress.PostCode'].str.startswith("KT").sum()

13

### Stuff to cover

Look at:
* Broadcast operations ```df.mean()```


* Working with missing data ```df.fillna()``` - DONE
Trademarks we didn't match

* Comparisons ```df.gt()``

* Descriptive/statistics ```df.describe(), df.nunique(), df.hist()```

* Row/column/element-wise function application ```df.apply(lambda x: x**2)```

* Vectorized string methods ```df.str.lower()``` - PARTIAL
Postcode maniupulation

* SQL-like merging ```pd.merge()``` - DONE
Merging postcodes codes in 

* Grouping ```df.groupby().groups```

* Plotting ```df.hist()```

* Masks


### TODO

* Take sample of training sets and commit these?
* 