### How to follow this tutorial

* Go to http://http://162.13.127.11:8000/
* This will serve up a temporary notebook server that you can use
* Warning: your notebook server will die after 10 minutes of activity
* We're going to be processing some CSV and XML data relating to UK businesses

### Imports and magics

First of all, we'll import the libraries we expect to need and call the magic functions we expect to use.

In [1]:
from pandas import DataFrame, Series
import pandas as pd
import numpy as np
from lxml import etree

### Load reference data

Pandas has a rich Import/Output API that works with common formats like CSV, Excel, SQL, pickled files (and even the clipboard) as well as more specialist formats like HDF5.

These (input) exist as top level functions ```pd.read_csv()``` and (output) on objects like ```DataFrames``` ```df.to_csv()```.

Lets try and use ```read_csv``` to load in a sample of the list of all the registered businesses in the UK. There is a sample dataset in ```datasets/companies/companies-house.csv```.

In [160]:
df = pd.read_csv('datasets/companies/companies-house.csv')

### DataFrame basics

Now that we've got out dataframe, what sort of things can we do with it?

In [161]:
# Have a look at the columns we've got
df.columns

Index([u'CompanyName', u' CompanyNumber', u'RegAddress.CareOf',
       u'RegAddress.POBox', u'RegAddress.AddressLine1',
       u' RegAddress.AddressLine2', u'RegAddress.PostTown',
       u'RegAddress.County', u'RegAddress.Country', u'RegAddress.PostCode',
       u'CompanyCategory', u'CompanyStatus', u'CountryOfOrigin',
       u'DissolutionDate', u'IncorporationDate', u'Accounts.AccountRefDay',
       u'Accounts.AccountRefMonth', u'Accounts.NextDueDate',
       u'Accounts.LastMadeUpDate', u'Accounts.AccountCategory',
       u'Returns.NextDueDate', u'Returns.LastMadeUpDate',
       u'Mortgages.NumMortCharges', u'Mortgages.NumMortOutstanding',
       u'Mortgages.NumMortPartSatisfied', u'Mortgages.NumMortSatisfied',
       u'SICCode.SicText_1', u'SICCode.SicText_2', u'SICCode.SicText_3',
       u'SICCode.SicText_4', u'LimitedPartnerships.NumGenPartners',
       u'LimitedPartnerships.NumLimPartners', u'URI',
       u'PreviousName_1.CONDATE', u' PreviousName_1.CompanyName',
       u' Previou

In [55]:
# and their datatypes
df.dtypes

Unnamed: 0                              int64
CompanyName                            object
 CompanyNumber                         object
RegAddress.CareOf                      object
RegAddress.POBox                       object
RegAddress.AddressLine1                object
 RegAddress.AddressLine2               object
RegAddress.PostTown                    object
RegAddress.County                      object
RegAddress.Country                     object
RegAddress.PostCode                    object
CompanyCategory                        object
CompanyStatus                          object
CountryOfOrigin                        object
DissolutionDate                       float64
IncorporationDate                      object
Accounts.AccountRefDay                float64
Accounts.AccountRefMonth              float64
Accounts.NextDueDate                   object
Accounts.LastMadeUpDate                object
Accounts.AccountCategory               object
Returns.NextDueDate               

In [56]:
# Inspect the first few or last rows
df.head()

Unnamed: 0.1,Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,...,PreviousName_7.CONDATE,PreviousName_7.CompanyName,PreviousName_8.CONDATE,PreviousName_8.CompanyName,PreviousName_9.CONDATE,PreviousName_9.CompanyName,PreviousName_10.CONDATE,PreviousName_10.CompanyName,clean_name,has_trademark
0,50,J.C.B. EQUIPMENT LIMITED,1000710,,,LAKESIDE WORKS,ROCESTER,STAFFS,,,...,,,,,,,,,j.c.b. equipment limited,0
1,175,J.C.W (NW) LIMITED,9028601,,,144 THATTO HEATH ROAD,,ST. HELENS,,,...,,,,,,,,,j.c.w (nw) limited,0
2,225,J.D BARTON LTD,9035149,,,38 HUNTER CLOSE,WILLESBOROUGH,ASHFORD,,,...,,,,,,,,,j.d barton ltd,0
3,252,J.D. AUTOMOTIVE (LUDLOW) LTD,4806642,,,SMITHFIELD CAR PARK,LOWER GALDEFORD,LUDLOW,SHROPSHIRE,,...,,,,,,,,,j.d. automotive (ludlow) ltd,0
4,346,J.D. PHILLIPS & DAUGHTERS LIMITED,4947680,,,LLANDISILIO STORES & POST OFFICE,LLANDISILIO,CLYNDERWEN,PEMBS.,,...,,,,,,,,,j.d. phillips & daughters limited,0


In [57]:
df.tail(10)

Unnamed: 0.1,Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,...,PreviousName_7.CONDATE,PreviousName_7.CompanyName,PreviousName_8.CONDATE,PreviousName_8.CompanyName,PreviousName_9.CONDATE,PreviousName_9.CompanyName,PreviousName_10.CONDATE,PreviousName_10.CompanyName,clean_name,has_trademark
11043,849165,PIPEWORKS MECHANICAL SERVICES (SOUTH EAST) LIM...,06506556,,,3 THE OLD PRINT HOUSE,RUSSELL STREET,DOVER,KENT,,...,,,,,,,,,pipeworks mechanical services (south east) lim...,0
11044,849265,PIPLAIN DESIGN UK LIMITED,08742811,,,"HALE HOUSE, UNIT 5",296A GREEN LANES,PALMERS GREEN,LONDON,,...,,,,,,,,,piplain design uk limited,0
11045,849316,PIPPA ROLLS DESIGNS LIMITED,07755892,,,270 COLCHESTER ROAD,,IPSWICH,,,...,,,,,,,,,pippa rolls designs limited,0
11046,849352,PIPPIN BUILDING PRODUCTS LIMITED,04573772,,,8 NEW FIELDS,2 STINSFORD ROAD NUFFIELD,POOLE,DORSET,,...,,,,,,,,,pippin building products limited,0
11047,849689,PIRBUDAK LIMITED,08824795,,,834 HERTFORD ROAD,,ENFIELD,MIDDLESEX,,...,,,,,,,,,pirbudak limited,0
11048,849732,PIRI FINO FRANCHISE LIMITED,08273169,,,573 CHESTER ROAD,,SUTTON COLDFIELD,W MIDLANDS,,...,,,,,,,,,piri fino franchise limited,0
11049,849793,PIRNIE FARMS,SL000893,,,PIRNIE,KELSO,,,,...,,,,,,,,,pirnie farms,0
11050,849827,PIRRIE DESIGN CAPITAL LTD.,08742539,,,209 CRANMER COURT,WHITEHEADS GROVE,LONDON,,,...,,,,,,,,,pirrie design capital ltd.,0
11051,849880,PIS CONTRACTING LTD,09642072,,,BRANSTON COURT,BRANSTON STREET,BIRMINGHAM,,UNITED KINGDOM,...,,,,,,,,,pis contracting ltd,0
11052,849984,PISGAH PARTNERS LIMITED,08631383,,,4 FITZGERALD AVENUE,,LONDON,,,...,,,,,,,,,pisgah partners limited,0


In [59]:
# Get out hands on the raw data
df.values

array([[50, 'J.C.B. EQUIPMENT LIMITED', '01000710', ..., nan,
        'j.c.b. equipment limited', 0],
       [175, 'J.C.W (NW) LIMITED', '09028601', ..., nan,
        'j.c.w (nw) limited', 0],
       [225, 'J.D BARTON LTD', '09035149', ..., nan, 'j.d barton ltd', 0],
       ..., 
       [849827, 'PIRRIE DESIGN CAPITAL LTD.', '08742539', ..., nan,
        'pirrie design capital ltd.', 0],
       [849880, 'PIS CONTRACTING LTD', '09642072', ..., nan,
        'pis contracting ltd', 0],
       [849984, 'PISGAH PARTNERS LIMITED', '08631383', ..., nan,
        'pisgah partners limited', 0]], dtype=object)

In [60]:
type(df.values)

numpy.ndarray

In [62]:
# Query to see how big our how big our data frame is
df.shape

(11053, 56)

In [63]:
# Quickly assess numerical data
df['Mortgages.NumMortCharges'].describe()

count    11053.000000
mean         0.779155
std          6.609417
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        312.000000
Name: Mortgages.NumMortCharges, dtype: float64

In [68]:
df['Mortgages.NumMortCharges'].sum()

8612

In [69]:
# And non-numerical data
df['SICCode.SicText_1'].describe()

count             11053
unique              556
top       None Supplied
freq               2477
Name: SICCode.SicText_1, dtype: object

In [70]:
# We can also call this on a frame as a whole and specify which types to include
df.describe(include=['object'])

Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,...,PreviousName_2.CompanyName,PreviousName_3.CONDATE,PreviousName_3.CompanyName,PreviousName_4.CONDATE,PreviousName_4.CompanyName,PreviousName_5.CONDATE,PreviousName_5.CompanyName,PreviousName_6.CONDATE,PreviousName_6.CompanyName,clean_name
count,11053,11053,469,14,11022,7539,10738,5709,2565,10945,...,237,53,53,15,15,3,3,2,2,11053
unique,11053,11053,427,14,9489,5451,1766,368,25,9182,...,237,52,53,15,15,3,3,2,2,11053
top,LONDON HOLIDAY SERVICES LIMITED,764426,NEW WAVE ACCOUNTING,46,145-157 ST JOHN STREET,LONDON,LONDON,SURREY,ENGLAND,EC1V 4PW,...,PRECIS (2464) LIMITED,30/01/1998,BITS AND PIXELS LTD,16/08/1996,"DONALD HEALEY MOTOR COMPANY,LIMITED",09/11/2010,EVER 1199 LIMITED,24/02/1999,SOUTH WALES RADIO LIMITED,olsense limited
freq,1,1,7,1,46,131,1937,326,1274,52,...,1,2,1,1,1,1,1,1,1,1


In [73]:
# Each Company Number is unique..
df[' CompanyNumber'].nunique()

11053

### Data types

The main pandas datatypes are ```float```, ```int```, ```bool```, ```datetime64```, ```timedelta``` and ```object```. ```float``` and ```int``` also have 32 and 64 bit variants.

Lets take a look at the datatypes in our dataframe again.

In [75]:
df.dtypes

Unnamed: 0                              int64
CompanyName                            object
 CompanyNumber                         object
RegAddress.CareOf                      object
RegAddress.POBox                       object
RegAddress.AddressLine1                object
 RegAddress.AddressLine2               object
RegAddress.PostTown                    object
RegAddress.County                      object
RegAddress.Country                     object
RegAddress.PostCode                    object
CompanyCategory                        object
CompanyStatus                          object
CountryOfOrigin                        object
DissolutionDate                       float64
IncorporationDate                      object
Accounts.AccountRefDay                float64
Accounts.AccountRefMonth              float64
Accounts.NextDueDate                   object
Accounts.LastMadeUpDate                object
Accounts.AccountCategory               object
Returns.NextDueDate               

In [76]:
df['IncorporationDate'] = pd.to_datetime(df['IncorporationDate'])

In [78]:
df.dtypes

Unnamed: 0                                     int64
CompanyName                                   object
 CompanyNumber                                object
RegAddress.CareOf                             object
RegAddress.POBox                              object
RegAddress.AddressLine1                       object
 RegAddress.AddressLine2                      object
RegAddress.PostTown                           object
RegAddress.County                             object
RegAddress.Country                            object
RegAddress.PostCode                           object
CompanyCategory                               object
CompanyStatus                                 object
CountryOfOrigin                               object
DissolutionDate                              float64
IncorporationDate                     datetime64[ns]
Accounts.AccountRefDay                       float64
Accounts.AccountRefMonth                     float64
Accounts.NextDueDate                          

In [79]:
df['IncorporationDate'].describe()

count                   11053
unique                   4467
top       2015-01-19 00:00:00
freq                       18
first     1864-01-04 00:00:00
last      2015-12-06 00:00:00
Name: IncorporationDate, dtype: object

### Selecting rows and columns

How can we slice and dice with pandas?

```[]``` generally returns a lower dimensional datastructure

In [80]:
type(df['CompanyName'])

pandas.core.series.Series

In [82]:
names = df['CompanyName']
type(names[0])

str

In [84]:
# We can also pass a list of columns
type(df[['CompanyName', 'IncorporationDate']])

pandas.core.frame.DataFrame

In [85]:
df[['CompanyName', 'IncorporationDate']].head(2)

Unnamed: 0,CompanyName,IncorporationDate
0,J.C.B. EQUIPMENT LIMITED,1971-01-25
1,J.C.W (NW) LIMITED,2014-07-05


We can also access columns in DataFrames as an attribute, though this comes with a few caveats.

In [86]:
df.CompanyName.head()

0             J.C.B. EQUIPMENT LIMITED
1                   J.C.W (NW) LIMITED
2                       J.D BARTON LTD
3         J.D. AUTOMOTIVE (LUDLOW) LTD
4    J.D. PHILLIPS & DAUGHTERS LIMITED
Name: CompanyName, dtype: object

In [88]:
df.Mortgages.NumMortCharges

AttributeError: 'DataFrame' object has no attribute 'Mortgages'

Try using tab-complete...

A simple way of slicing rows is to use ```[]```. This works in much the same way as standard Python slicing.

How would you access the first row in a DataFrame?

In [91]:
df[:1]

Unnamed: 0.1,Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,...,PreviousName_7.CONDATE,PreviousName_7.CompanyName,PreviousName_8.CONDATE,PreviousName_8.CompanyName,PreviousName_9.CONDATE,PreviousName_9.CompanyName,PreviousName_10.CONDATE,PreviousName_10.CompanyName,clean_name,has_trademark
0,50,J.C.B. EQUIPMENT LIMITED,1000710,,,LAKESIDE WORKS,ROCESTER,STAFFS,,,...,,,,,,,,,j.c.b. equipment limited,0


How would you access all but the last row in a DataFrame?

In [100]:
df.tail(3)

Unnamed: 0.1,Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,...,PreviousName_7.CONDATE,PreviousName_7.CompanyName,PreviousName_8.CONDATE,PreviousName_8.CompanyName,PreviousName_9.CONDATE,PreviousName_9.CompanyName,PreviousName_10.CONDATE,PreviousName_10.CompanyName,clean_name,has_trademark
11050,849827,PIRRIE DESIGN CAPITAL LTD.,8742539,,,209 CRANMER COURT,WHITEHEADS GROVE,LONDON,,,...,,,,,,,,,pirrie design capital ltd.,0
11051,849880,PIS CONTRACTING LTD,9642072,,,BRANSTON COURT,BRANSTON STREET,BIRMINGHAM,,UNITED KINGDOM,...,,,,,,,,,pis contracting ltd,0
11052,849984,PISGAH PARTNERS LIMITED,8631383,,,4 FITZGERALD AVENUE,,LONDON,,,...,,,,,,,,,pisgah partners limited,0


In [124]:
print df.shape[0]
print df[:df.shape[0]-1].shape[0]
df[:df.shape[0]-1].tail(2)

11053
11052


Unnamed: 0.1,Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,...,PreviousName_7.CONDATE,PreviousName_7.CompanyName,PreviousName_8.CONDATE,PreviousName_8.CompanyName,PreviousName_9.CONDATE,PreviousName_9.CompanyName,PreviousName_10.CONDATE,PreviousName_10.CompanyName,clean_name,has_trademark
11050,849827,PIRRIE DESIGN CAPITAL LTD.,8742539,,,209 CRANMER COURT,WHITEHEADS GROVE,LONDON,,,...,,,,,,,,,pirrie design capital ltd.,0
11051,849880,PIS CONTRACTING LTD,9642072,,,BRANSTON COURT,BRANSTON STREET,BIRMINGHAM,,UNITED KINGDOM,...,,,,,,,,,pis contracting ltd,0


We can also introduce a step parameter

In [114]:
df[:20:10]

Unnamed: 0.1,Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1,RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,...,PreviousName_7.CONDATE,PreviousName_7.CompanyName,PreviousName_8.CONDATE,PreviousName_8.CompanyName,PreviousName_9.CONDATE,PreviousName_9.CompanyName,PreviousName_10.CONDATE,PreviousName_10.CompanyName,clean_name,has_trademark
0,50,J.C.B. EQUIPMENT LIMITED,1000710,,,LAKESIDE WORKS,ROCESTER,STAFFS,,,...,,,,,,,,,j.c.b. equipment limited,0
10,722,J.E.K. ASSOCIATES LIMITED,6897464,,,6 TIPPENDELL LANE,CHISWELL GREEN,ST ALBANS,HERTFORDSHIRE,,...,,,,,,,,,j.e.k. associates limited,0


How would we take every third company name of the first 10 companies?

ie 3rd, 6th and 9th

In [121]:
df[:10].CompanyName

0                      J.C.B. EQUIPMENT LIMITED
1                            J.C.W (NW) LIMITED
2                                J.D BARTON LTD
3                  J.D. AUTOMOTIVE (LUDLOW) LTD
4             J.D. PHILLIPS & DAUGHTERS LIMITED
5                            J.D. WYATT LIMITED
6              J.D.H. BUSINESS SERVICES LIMITED
7                           J.E. PORTER LIMITED
8    J.E. STRUTT FARMING (PLANT & HIRE) LIMITED
9                      J.E.BALLARD & CO.LIMITED
Name: CompanyName, dtype: object

In [122]:
df[2:10:3].CompanyName

2                                J.D BARTON LTD
5                            J.D. WYATT LIMITED
8    J.E. STRUTT FARMING (PLANT & HIRE) LIMITED
Name: CompanyName, dtype: object

pandas allows selection by label, position, index and advanced hierarchical indexing.

The devil is in the detail, and its easy to get into a mess.

```df.loc``` is used for selection via the index and can take a label, a list of labels, a slice of lables and a boolean array. The rules governing things like which elements are inclusive/exclusive vary from what you might expect.

In [133]:
# Take the row with the index label 0 - this is NOT generally the 0th element
df.loc[0].CompanyName

'J.C.B. EQUIPMENT LIMITED'

```.loc``` can also take a list of index labels. How would we take every other element, upto the 10th, starting at the 2nd?

In [134]:
df.loc[range(1,10,2)].CompanyName

1              J.C.W (NW) LIMITED
3    J.D. AUTOMOTIVE (LUDLOW) LTD
5              J.D. WYATT LIMITED
7             J.E. PORTER LIMITED
9        J.E.BALLARD & CO.LIMITED
Name: CompanyName, dtype: object

```.loc``` can also take a slice object. Both the stop and the start are included. How would we take the first 6 company names?

In [136]:
df.loc[0:5].CompanyName

0             J.C.B. EQUIPMENT LIMITED
1                   J.C.W (NW) LIMITED
2                       J.D BARTON LTD
3         J.D. AUTOMOTIVE (LUDLOW) LTD
4    J.D. PHILLIPS & DAUGHTERS LIMITED
5                   J.D. WYATT LIMITED
Name: CompanyName, dtype: object

In [138]:
# Compare this to
df[0:5].CompanyName

0             J.C.B. EQUIPMENT LIMITED
1                   J.C.W (NW) LIMITED
2                       J.D BARTON LTD
3         J.D. AUTOMOTIVE (LUDLOW) LTD
4    J.D. PHILLIPS & DAUGHTERS LIMITED
Name: CompanyName, dtype: object

We can also pass in boolean arrays.

In [141]:
df.loc[[True, False, True]].CompanyName

0    J.C.B. EQUIPMENT LIMITED
2              J.D BARTON LTD
Name: CompanyName, dtype: object

We'll look at Boolean arrays in more detail later when we talk about Masks. For now, its worth pointing out that you can also pass in column labels to ```.loc```.

In [143]:
df.loc[0:5, ['CompanyName', 'IncorporationDate']]

Unnamed: 0,CompanyName,IncorporationDate
0,J.C.B. EQUIPMENT LIMITED,1971-01-25
1,J.C.W (NW) LIMITED,2014-07-05
2,J.D BARTON LTD,2014-12-05
3,J.D. AUTOMOTIVE (LUDLOW) LTD,2003-06-20
4,J.D. PHILLIPS & DAUGHTERS LIMITED,2003-10-30
5,J.D. WYATT LIMITED,2003-04-29


```.iloc``` allows selection by position. Most of this is pretty intuitive.

In [162]:
df.iloc[0, 0]

'J.C.B. EQUIPMENT LIMITED'

In [164]:
df.iloc[:5, [0,1]]

Unnamed: 0,CompanyName,CompanyNumber
0,J.C.B. EQUIPMENT LIMITED,1000710
1,J.C.W (NW) LIMITED,9028601
2,J.D BARTON LTD,9035149
3,J.D. AUTOMOTIVE (LUDLOW) LTD,4806642
4,J.D. PHILLIPS & DAUGHTERS LIMITED,4947680


In [167]:
df.iloc[:, 0:5].head()

Unnamed: 0,CompanyName,CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1
0,J.C.B. EQUIPMENT LIMITED,1000710,,,LAKESIDE WORKS
1,J.C.W (NW) LIMITED,9028601,,,144 THATTO HEATH ROAD
2,J.D BARTON LTD,9035149,,,38 HUNTER CLOSE
3,J.D. AUTOMOTIVE (LUDLOW) LTD,4806642,,,SMITHFIELD CAR PARK
4,J.D. PHILLIPS & DAUGHTERS LIMITED,4947680,,,LLANDISILIO STORES & POST OFFICE


In [169]:
# Pandas is fairly forgiving in terms of accessing out or ranges slices.
df.shape[0]

11053

In [171]:
df['CompanyName'].tail(5)

11048    PIRI FINO FRANCHISE LIMITED
11049                   PIRNIE FARMS
11050     PIRRIE DESIGN CAPITAL LTD.
11051            PIS CONTRACTING LTD
11052        PISGAH PARTNERS LIMITED
Name: CompanyName, dtype: object

In [170]:
df.iloc[df.shape[0]-4:df.shape[0]+4:, 0]

11049                  PIRNIE FARMS
11050    PIRRIE DESIGN CAPITAL LTD.
11051           PIS CONTRACTING LTD
11052       PISGAH PARTNERS LIMITED
Name: CompanyName, dtype: object

Finally ```.ix``` is more general and allows inputs that both ```.iloc``` and ```.loc``` would expect. It comes into its own when dealing with hierarchical indees that are of mixed types (ie positional and pure label based).

I tend to stick to ```.loc``` and avoid hierarchical indexes.

One other function that is worth mentioning is ```.sample()```.

In [172]:
df['CompanyName'].sample(n=10, replace=False, weights=np.random.rand(df.shape[0]))

10648                          PERSPECTIVE TRADING LIMITED
1889     KINGSTON BUSINESS DEVELOPMENT SERVICES (U.K.) ...
4138                            LV POWER SOLUTIONS LIMITED
963                           JS INVESTMENT MANAGEMENT LTD
3835                            LONDON WINGS CORPORATE LTD
10158                              PATCH SOLUTIONS LIMITED
1482                        KEBLE COURT (STAMFORD) LIMITED
824                            JOINT VENTURE CONTRACTS LTD
9843                    PALLADIUM CLOTHING COMPANY LIMITED
9500                           OVERLOAD MANAGEMENT LIMITED
Name: CompanyName, dtype: object

How would we generate a sample, favouring those that have mortgages?

In [173]:
df[['CompanyName', 'Mortgages.NumMortCharges']].sample(n=10, 
                                                       weights=df['Mortgages.NumMortCharges'])

Unnamed: 0,CompanyName,Mortgages.NumMortCharges
7188,MULTIGRAPHICS LIMITED,15
5173,MARTON COUNTRY CLUB LIMITED,7
6890,MORTLAKE UNDERWRITING LIMITED,128
7531,NAMECO (NO.804) LIMITED,70
4952,MAREHAY GARAGE LIMITED,2
2199,KRNS PROPERTIES LTD,10
5947,METNOR PROPERTY GROUP LIMITED,10
810,JOHNSTON BERRY LTD,33
1439,KBC PUBLIC RELATIONS & MARKETING LTD.,4
4016,LS CLAYTON SQUARE LIMITED,11


### Flow

* Unzip etc using magics out to command line
* Read CSV as ASCII, show encoding errors
* Read as UTF8, show has dealt with them correctly
* Normalise names?
* Normalise eg postcode/split postcodes?
* Do some aggregations and grouping?
* Masks - ensure use different criteria to show usefulness of reusability
* Could we do something with PDF here? HTML pages might also be useful (download, unzip, process HTML)

### Preprocess and load reference data

In [15]:
#!unzip '/var/data/s2ds/companies/*.zip' -d /var/data/s2ds/companies/

In [9]:
def read_files(files):
    dfs = []

    for f in files:
        df = pd.read_csv(f)
        dfs.append(df)
        #dfs.append(df[['CompanyName', ' CompanyNumber', 'RegAddress.PostCode']])

    return pd.concat(dfs)

In [17]:
def read_files_utf8(files):
    dfs = []

    for f in files:
        df = pd.read_csv(f, encoding='utf8')
        dfs.append(df[['CompanyName', ' CompanyNumber', 'RegAddress.PostCode']])

    return pd.concat(dfs)

In [3]:
files = !ls /var/data/s2ds/companies/*_5.csv

In [7]:
files = ['/var/data/s2ds/companies/BasicCompanyData-2015-07-01-part3_5.csv']

In [10]:
df = read_files(files)

In [29]:
reg_to_keep = ['02334804']

In [13]:
df['clean_name'] = df['CompanyName'].str.lower()

### Preprocess and load UK IPO Office data

In [14]:
tree = etree.parse('/var/data/s2ds/trademarks/jnl.xml')

In [15]:
doc = tree.getroot()

In [16]:
applicants = [a.text.lower() for a in doc.findall('.//ApplicantName')]

In [17]:
trademarks = pd.DataFrame({'applicants': applicants})

In [18]:
df['has_trademark'] = 0

In [19]:
df.loc[df.clean_name.isin(trademarks.applicants), 'has_trademark'] = 1

In [30]:
reg_to_keep.extend(df[df.has_trademark == 1][' CompanyNumber'].tolist())

In [31]:
reg_to_keep

['02334804',
 '03608956',
 '03125118',
 '00713656',
 '09013389',
 '02360971',
 '09423212',
 '08817151',
 '08934484',
 '09600978',
 '06781096',
 '08592122',
 '02029122',
 '01417162',
 '08525270',
 '03035047',
 '09226081',
 '08648114',
 '07412962',
 '02075721',
 '08603549',
 'SC247423',
 '08575347',
 '05551430',
 '09514380',
 '09445835',
 '05331534',
 '07146235',
 '09455861',
 '09242421',
 '09355175',
 '09405019',
 '00679848',
 '09328054',
 '05738790',
 '01700202',
 '01116428',
 '04768495',
 '09608143',
 '06273411',
 '07363903',
 '02012744',
 '09047353',
 '01579311',
 '09515476',
 '09014991',
 '09306540',
 '08872718',
 '06572441',
 '06074095',
 '08504449',
 '08274220',
 '05094961',
 '05595291',
 '06014528',
 '07319695',
 '06497115',
 '06029008',
 '01576542',
 '07768433',
 '08769970',
 '09586836',
 '07946659']

In [41]:
import random

reg_to_keep.extend(df.ix[random.sample(df.index, 10000)][' CompanyNumber'].tolist())


In [42]:
reg_to_keep = list(set(reg_to_keep))

In [43]:
len(reg_to_keep)

11053

In [46]:
df[df[' CompanyNumber'].isin(reg_to_keep)].to_csv('datasets/companies/companies-house.csv')

In [106]:
trademarks = pd.merge(trademarks, df, how='left', left_on='applicants', right_on='clean_name')

In [107]:
trademarks.head()

Unnamed: 0,applicants,CompanyName,CompanyNumber,RegAddress.PostCode,clean_name
0,barefaced skincare limited,BAREFACED SKINCARE LIMITED,09525777,OX26 4LD,barefaced skincare limited
1,floreana ltd,FLOREANA LTD,SC432789,EH6 5SD,floreana ltd
2,floreana ltd,FLOREANA LTD,SC432789,EH6 5SD,floreana ltd
3,step2progress limited,STEP2PROGRESS LIMITED,09301218,SY1 3AF,step2progress limited
4,peter john savage,,,,


In [109]:
trademarks.clean_name.isnull().sum()

601

In [111]:
trademarks[trademarks.clean_name.isnull()].head(10)

Unnamed: 0,applicants,CompanyName,CompanyNumber,RegAddress.PostCode,clean_name
4,peter john savage,,,,
5,vincent dassault,,,,
6,ian redman,,,,
7,turbomed orthotics inc.,,,,
9,marek ryan larwood,,,,
17,péan & zakian llp,,,,
18,paul karaiskos,,,,
19,aran osman,,,,
20,barrington brown,,,,
21,joe bearmen,,,,


In [112]:
trademarks = trademarks[trademarks.clean_name.notnull()]

In [115]:
trademarks['RegAddress.PostCode'].str.startswith("KT").sum()

13

### TODO

* Take sample of training sets and commit these?
* 