  # Data Cleaning with Pandas #

Welcome to your introduction to Pandas and Jupyter Notebook!

Today we're going to learn how to read in a csv file, create a dataframe, identify different ways it may be dirty and learn some techniques for cleaning up our data set. 

karrie.anne.kehoe@gmail.com/@karriekehoe

## Getting to grips with Jupyter Notebook and Pandas

Jupyter Notebook is an interactive, browser based programing environment. It can be used for multiple programming languages, for writing documentation and visualising data. If you want to learn more about what Jupyter Notebook can read its documentation at http://jupyter-notebook.readthedocs.io/en/latest/notebook.html

Pandas is a python library, designed for statistical analysis. It's very flexible, easy to use and has a range of useful built in functions. If you want to learn more about what Pandas can do, you can read its documentation at http://pandas.pydata.org/ or browse the cook book at http://pandas.pydata.org/pandas-docs/version/0.18.1/tutorials.html  

### Shortcuts:
* `esc` - takes you into command mode
* `a` - insert cell above
* `b` - insert cell below
* `shift then tab` will show you the documentation for your code
* `shift and enter` will run your cell
* ` d d` will delete a cell

### Terminology ###

**Dataframe** - a dataframe is a two dimensional tabular data structure with labeled axes

**Series** - a series is similar to a list, array or a single column within a dataframe

## Starting off

First we need to import Pandas our python library to do so we use the line of code below. We use 'pd' as an alias to make it easier when typing in our code.
We are going to type 

`import pandas as pd`

In [165]:
import pandas as pd

Now we create a dataframe and read in our csv.

`df = pd.read_csv('filepath')`

In [166]:
df = pd.read_csv('/Users/karrie/Desktop/DATA TRAINING/CIJ 2017/results.csv')

Let's look at the first ten rows of our data

`df.head(10)`

In [167]:
df.head(10)

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,Value,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,...,ReceivedDate,ReportedDate,IsReportedPrePoll,ReportingPeriodName,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,"£2,500.00",01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,...,01/04/2017,07/04/2017,,April 2017,False,False,1491,,74251,
1,NC0317561,UK Independence Party (UKIP),Political Party,"£2,645.00",31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,...,31/03/2017,25/04/2017,,Q1 2017,False,False,85,4928.0,68148,
2,C0317662,Liberal Democrats,Political Party,"£2,858.20",31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,...,26/03/2017,29/04/2017,,Q1 2017,False,True,90,5276.0,43044,
3,C0317679,Liberal Democrats,Political Party,"£20,000.00",31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,...,29/03/2017,29/04/2017,,Q1 2017,False,False,90,1849.0,77470,
4,C0317636,Liberal Democrats,Political Party,"£3,050.00",31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,...,10/03/2017,29/04/2017,,Q1 2017,False,False,90,2376.0,50614,
5,C0317650,Liberal Democrats,Political Party,"£5,000.00",31/03/2017,South Gloucestershire,Mr Kenneth Douglas,False,False,Individual,...,15/03/2017,29/04/2017,,Q1 2017,False,False,90,5141.0,37458,
6,C0317631,Liberal Democrats,Political Party,"£1,500.00",31/03/2017,Mid Dorset and North Poole,Broadstone Lib Hall Ctte,False,False,Unincorporated Association,...,29/03/2017,29/04/2017,,Q1 2017,False,True,90,2116.0,34484,
7,C0317674,Liberal Democrats,Political Party,"£3,000.00",31/03/2017,Twickenham and Richmond,Richmond Upon Thames Lib Dem Council Group,False,False,Unincorporated Association,...,01/03/2017,29/04/2017,,Q1 2017,False,True,90,2346.0,35386,
8,NC0317706,Liberal Democrats,Political Party,"£3,225.00",31/03/2017,Colchester,Magdalen Hall Company Limited,False,False,Company,...,28/03/2017,29/04/2017,,Q1 2017,False,True,90,1892.0,35428,
9,C0317629,Liberal Democrats,Political Party,"£15,945.07",31/03/2017,Central Party,Mr Anthony Dunn,False,False,Individual,...,29/03/2017,29/04/2017,,Q1 2017,False,False,90,,75232,


Next we need to know what data types we're dealing with for each column in our dataframe

`df.dtypes`

In [168]:
df.dtypes

﻿ECRef                            object
RegulatedEntityName               object
RegulatedEntityType               object
Value                             object
AcceptedDate                      object
AccountingUnitName                object
DonorName                         object
AccountingUnitsAsCentralParty       bool
IsSponsorship                       bool
DonorStatus                       object
RegulatedDoneeType                object
CompanyRegistrationNumber         object
Postcode                          object
DonationType                      object
NatureOfDonation                  object
PurposeOfVisit                   float64
DonationAction                    object
ReceivedDate                      object
ReportedDate                      object
IsReportedPrePoll                 object
ReportingPeriodName               object
IsBequest                           bool
IsAggregation                       bool
RegulatedEntityId                  int64
AccountingUnitId

We use .shape to find the dimensions of our data

`df.shape`

In [169]:
df.shape

(9151, 27)

## Data Problems: 
* Dates are python objects and not a datetime object
* Values are python objects and not ints or floats, so we can't perform any calculations on them
* We need a year column, perhaps even a month column
* There may be leading or strail spaces in our data

Before we change anything we're going to create a copy of our dataframe and clean that up

`df2.copy()`

In [170]:
df2 = df.copy()

## Cleaning strings

We need to clean up the value column and convert it to an integer so we can count it. How do we check that it's worked?

`df2['col'] = df2['col'].str.replace('£', '')`

In [171]:
df2['Value_clean'] = df2['Value'].str.replace('£', '')

In [172]:
df2.head()

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,Value,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,...,ReportedDate,IsReportedPrePoll,ReportingPeriodName,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName,Value_clean
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,"£2,500.00",01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,...,07/04/2017,,April 2017,False,False,1491,,74251,,2500.0
1,NC0317561,UK Independence Party (UKIP),Political Party,"£2,645.00",31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,...,25/04/2017,,Q1 2017,False,False,85,4928.0,68148,,2645.0
2,C0317662,Liberal Democrats,Political Party,"£2,858.20",31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,...,29/04/2017,,Q1 2017,False,True,90,5276.0,43044,,2858.2
3,C0317679,Liberal Democrats,Political Party,"£20,000.00",31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,...,29/04/2017,,Q1 2017,False,False,90,1849.0,77470,,20000.0
4,C0317636,Liberal Democrats,Political Party,"£3,050.00",31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,...,29/04/2017,,Q1 2017,False,False,90,2376.0,50614,,3050.0


In [173]:
df2['Value_clean'] = df2['Value_clean'].str.replace(',', '')
df2['Value_clean'] = df2['Value_clean'].str.replace('.', '')

In [174]:
df2.head()

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,Value,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,...,ReportedDate,IsReportedPrePoll,ReportingPeriodName,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName,Value_clean
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,"£2,500.00",01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,...,07/04/2017,,April 2017,False,False,1491,,74251,,250000
1,NC0317561,UK Independence Party (UKIP),Political Party,"£2,645.00",31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,...,25/04/2017,,Q1 2017,False,False,85,4928.0,68148,,264500
2,C0317662,Liberal Democrats,Political Party,"£2,858.20",31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,...,29/04/2017,,Q1 2017,False,True,90,5276.0,43044,,285820
3,C0317679,Liberal Democrats,Political Party,"£20,000.00",31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,...,29/04/2017,,Q1 2017,False,False,90,1849.0,77470,,2000000
4,C0317636,Liberal Democrats,Political Party,"£3,050.00",31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,...,29/04/2017,,Q1 2017,False,False,90,2376.0,50614,,305000


Now check to see if that worked

In [175]:
df2.dtypes

﻿ECRef                            object
RegulatedEntityName               object
RegulatedEntityType               object
Value                             object
AcceptedDate                      object
AccountingUnitName                object
DonorName                         object
AccountingUnitsAsCentralParty       bool
IsSponsorship                       bool
DonorStatus                       object
RegulatedDoneeType                object
CompanyRegistrationNumber         object
Postcode                          object
DonationType                      object
NatureOfDonation                  object
PurposeOfVisit                   float64
DonationAction                    object
ReceivedDate                      object
ReportedDate                      object
IsReportedPrePoll                 object
ReportingPeriodName               object
IsBequest                           bool
IsAggregation                       bool
RegulatedEntityId                  int64
AccountingUnitId

## Changing data types

Ok, no luck. We need to explicitly change the data type for the new Value clean column.

`df2['Value_clean'] = pd.to_numeric(df2['Value_clean'])`

In [176]:
df2['Value_clean'] = pd.to_numeric(df2['Value_clean'])
df2.dtypes

﻿ECRef                            object
RegulatedEntityName               object
RegulatedEntityType               object
Value                             object
AcceptedDate                      object
AccountingUnitName                object
DonorName                         object
AccountingUnitsAsCentralParty       bool
IsSponsorship                       bool
DonorStatus                       object
RegulatedDoneeType                object
CompanyRegistrationNumber         object
Postcode                          object
DonationType                      object
NatureOfDonation                  object
PurposeOfVisit                   float64
DonationAction                    object
ReceivedDate                      object
ReportedDate                      object
IsReportedPrePoll                 object
ReportingPeriodName               object
IsBequest                           bool
IsAggregation                       bool
RegulatedEntityId                  int64
AccountingUnitId

However we need to make sure that we count our donations amount correctly. So let's divide the column by 100.


In [177]:
df2['Value_clean'] = df2['Value_clean']/100
df2.head()

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,Value,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,...,ReportedDate,IsReportedPrePoll,ReportingPeriodName,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName,Value_clean
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,"£2,500.00",01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,...,07/04/2017,,April 2017,False,False,1491,,74251,,2500.0
1,NC0317561,UK Independence Party (UKIP),Political Party,"£2,645.00",31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,...,25/04/2017,,Q1 2017,False,False,85,4928.0,68148,,2645.0
2,C0317662,Liberal Democrats,Political Party,"£2,858.20",31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,...,29/04/2017,,Q1 2017,False,True,90,5276.0,43044,,2858.2
3,C0317679,Liberal Democrats,Political Party,"£20,000.00",31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,...,29/04/2017,,Q1 2017,False,False,90,1849.0,77470,,20000.0
4,C0317636,Liberal Democrats,Political Party,"£3,050.00",31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,...,29/04/2017,,Q1 2017,False,False,90,2376.0,50614,,3050.0


## Dropping and re-naming columns

Let's clean up our dataframe a bit by dropping the original Value column - the 1 is the index, so we're saying it's the column with the value Value in the first row 

`df2 = df2.drop('Value', 1)`

In [178]:
df2 = df2.drop('Value', 1)
df2.head()

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,RegulatedDoneeType,...,ReportedDate,IsReportedPrePoll,ReportingPeriodName,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName,Value_clean
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,MP - Member of Parliament,...,07/04/2017,,April 2017,False,False,1491,,74251,,2500.0
1,NC0317561,UK Independence Party (UKIP),Political Party,31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,,...,25/04/2017,,Q1 2017,False,False,85,4928.0,68148,,2645.0
2,C0317662,Liberal Democrats,Political Party,31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,,...,29/04/2017,,Q1 2017,False,True,90,5276.0,43044,,2858.2
3,C0317679,Liberal Democrats,Political Party,31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,,...,29/04/2017,,Q1 2017,False,False,90,1849.0,77470,,20000.0
4,C0317636,Liberal Democrats,Political Party,31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,,...,29/04/2017,,Q1 2017,False,False,90,2376.0,50614,,3050.0


Now that's gone, let's rename the Value clean column

`df2 = df2.rename(columns={'old_name': 'new_name'})`

In [179]:
df2 = df2.rename(columns={'Value_clean': 'Value'})
df2.head()

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,RegulatedDoneeType,...,ReportedDate,IsReportedPrePoll,ReportingPeriodName,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName,Value
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,MP - Member of Parliament,...,07/04/2017,,April 2017,False,False,1491,,74251,,2500.0
1,NC0317561,UK Independence Party (UKIP),Political Party,31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,,...,25/04/2017,,Q1 2017,False,False,85,4928.0,68148,,2645.0
2,C0317662,Liberal Democrats,Political Party,31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,,...,29/04/2017,,Q1 2017,False,True,90,5276.0,43044,,2858.2
3,C0317679,Liberal Democrats,Political Party,31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,,...,29/04/2017,,Q1 2017,False,False,90,1849.0,77470,,20000.0
4,C0317636,Liberal Democrats,Political Party,31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,,...,29/04/2017,,Q1 2017,False,False,90,2376.0,50614,,3050.0


Let's make sure there aren't any trail or leading spaces in the column names, if so this could cause havoc

`df2.columns`

In [180]:
df2.columns

Index([u'﻿ECRef', u'RegulatedEntityName', u'RegulatedEntityType',
       u'AcceptedDate', u'AccountingUnitName', u'DonorName',
       u'AccountingUnitsAsCentralParty', u'IsSponsorship', u'DonorStatus',
       u'RegulatedDoneeType', u'CompanyRegistrationNumber', u'Postcode',
       u'DonationType', u'NatureOfDonation', u'PurposeOfVisit',
       u'DonationAction', u'ReceivedDate', u'ReportedDate',
       u'IsReportedPrePoll', u'ReportingPeriodName', u'IsBequest',
       u'IsAggregation', u'RegulatedEntityId', u'AccountingUnitId', u'DonorId',
       u'CampaigningName', u'Value'],
      dtype='object')

All good, but maybe there are some in the donor names.

`df2['column'].unique()`

In [181]:
df2['DonorName'].unique()

array(['Kirklee Property Company (2) Ltd', 'Arromax Structures Ltd',
       'Rugby Lib Dem Council Group', ..., ' Anthony Clarke',
       'Professor Peter F Saville', 'Mrs Jelena Guadagnini'], dtype=object)

Yup just like I thought, there is a leading space in Anthony Clarke, we need to fix that. First though we need to make sure that the DonorName is a column of strings, if there are any numeric names in there it'll confuse python and we won't be able to manipulate the column

`df2['column'] = df2['column'].astype(str)`

In [182]:
df2['DonorName'] = df2['DonorName'].astype(str)


Ok now we're going to strip out any of those trail spaces

`df2['column']=df2['column'].map(str.strip)`

In [183]:
df2['DonorName_clean']=df2['DonorName'].map(str.strip)

Did that work?

`df2[column].unique()`

In [184]:
df2['DonorName_clean'].unique()

array(['Kirklee Property Company (2) Ltd', 'Arromax Structures Ltd',
       'Rugby Lib Dem Council Group', ..., 'Anthony Clarke',
       'Professor Peter F Saville', 'Mrs Jelena Guadagnini'], dtype=object)

## Dates and Years

Ok, the reporting periods are pretty messy, so let's create a new column with the year value in there. We can extract that from one of the date columns. To do that we need to import the new library called datetime

`import datetime`

In [185]:
import datetime

Now we're going to strip the year from the accepted date and insert the value in a new column called 'Year'

`df2['YEAR'] = pd.DatetimeIndex(df2['AcceptedDate']).year`

In [186]:
df2['YEAR'] = pd.DatetimeIndex(df2['AcceptedDate']).year

Did that work? If so let's try do the same for the month value, same formula as year

In [187]:
df2.head()

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,RegulatedDoneeType,...,ReportingPeriodName,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName,Value,DonorName_clean,YEAR
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,MP - Member of Parliament,...,April 2017,False,False,1491,,74251,,2500.0,Kirklee Property Company (2) Ltd,2017
1,NC0317561,UK Independence Party (UKIP),Political Party,31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,,...,Q1 2017,False,False,85,4928.0,68148,,2645.0,Arromax Structures Ltd,2017
2,C0317662,Liberal Democrats,Political Party,31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,,...,Q1 2017,False,True,90,5276.0,43044,,2858.2,Rugby Lib Dem Council Group,2017
3,C0317679,Liberal Democrats,Political Party,31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,,...,Q1 2017,False,False,90,1849.0,77470,,20000.0,KIRLY LIMITED,2017
4,C0317636,Liberal Democrats,Political Party,31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,,...,Q1 2017,False,False,90,2376.0,50614,,3050.0,Dorset CC Lib Dem Council Group,2017


In [188]:
df2['MONTH'] = pd.DatetimeIndex(df2['AcceptedDate']).month

In [189]:
df2.head()

Unnamed: 0,﻿ECRef,RegulatedEntityName,RegulatedEntityType,AcceptedDate,AccountingUnitName,DonorName,AccountingUnitsAsCentralParty,IsSponsorship,DonorStatus,RegulatedDoneeType,...,IsBequest,IsAggregation,RegulatedEntityId,AccountingUnitId,DonorId,CampaigningName,Value,DonorName_clean,YEAR,MONTH
0,C0317788,The Rt Hon David Mundell MP,Regulated Donee,01/04/2017,,Kirklee Property Company (2) Ltd,False,False,Company,MP - Member of Parliament,...,False,False,1491,,74251,,2500.0,Kirklee Property Company (2) Ltd,2017,1
1,NC0317561,UK Independence Party (UKIP),Political Party,31/03/2017,Mansfield,Arromax Structures Ltd,False,False,Company,,...,False,False,85,4928.0,68148,,2645.0,Arromax Structures Ltd,2017,3
2,C0317662,Liberal Democrats,Political Party,31/03/2017,"Rugby, Nuneaton and North Warwickshire",Rugby Lib Dem Council Group,False,False,Unincorporated Association,,...,False,True,90,5276.0,43044,,2858.2,Rugby Lib Dem Council Group,2017,3
3,C0317679,Liberal Democrats,Political Party,31/03/2017,Cambridgeshire County Co-Ordinating Committee,KIRLY LIMITED,False,False,Company,,...,False,False,90,1849.0,77470,,20000.0,KIRLY LIMITED,2017,3
4,C0317636,Liberal Democrats,Political Party,31/03/2017,West Dorset,Dorset CC Lib Dem Council Group,False,False,Unincorporated Association,,...,False,False,90,2376.0,50614,,3050.0,Dorset CC Lib Dem Council Group,2017,3


## Shrinking dataframes

Ok finally let's get rid of some of the columns and make our dataframe a more manageable size

To do this we need a list of column names again


In [190]:
df2.columns

Index([u'﻿ECRef', u'RegulatedEntityName', u'RegulatedEntityType',
       u'AcceptedDate', u'AccountingUnitName', u'DonorName',
       u'AccountingUnitsAsCentralParty', u'IsSponsorship', u'DonorStatus',
       u'RegulatedDoneeType', u'CompanyRegistrationNumber', u'Postcode',
       u'DonationType', u'NatureOfDonation', u'PurposeOfVisit',
       u'DonationAction', u'ReceivedDate', u'ReportedDate',
       u'IsReportedPrePoll', u'ReportingPeriodName', u'IsBequest',
       u'IsAggregation', u'RegulatedEntityId', u'AccountingUnitId', u'DonorId',
       u'CampaigningName', u'Value', u'DonorName_clean', u'YEAR', u'MONTH'],
      dtype='object')

Alright let's figure what we need and shrink the dataframe

`df2 = df2[['RegulatedEntityName', 'AcceptedDate', 'DonorName_clean', 'DonorStatus', 'YEAR', 'Value', 'RegulatedEntityType', 'DonorId', 'CampaigningName']]`

In [191]:
df2 = df2[['RegulatedEntityName', 'AcceptedDate', 'DonorName_clean', 'DonorStatus', 'YEAR', 'Value', 'RegulatedEntityType', 'DonorId', 'CampaigningName']]


In [192]:
df2.head()

Unnamed: 0,RegulatedEntityName,AcceptedDate,DonorName_clean,DonorStatus,YEAR,Value,RegulatedEntityType,DonorId,CampaigningName
0,The Rt Hon David Mundell MP,01/04/2017,Kirklee Property Company (2) Ltd,Company,2017,2500.0,Regulated Donee,74251,
1,UK Independence Party (UKIP),31/03/2017,Arromax Structures Ltd,Company,2017,2645.0,Political Party,68148,
2,Liberal Democrats,31/03/2017,Rugby Lib Dem Council Group,Unincorporated Association,2017,2858.2,Political Party,43044,
3,Liberal Democrats,31/03/2017,KIRLY LIMITED,Company,2017,20000.0,Political Party,77470,
4,Liberal Democrats,31/03/2017,Dorset CC Lib Dem Council Group,Unincorporated Association,2017,3050.0,Political Party,50614,


## Saving our data

Ok finally let's save our clean for the next class

`df2.to_csv('clean_data.csv', encoding='utf8')`

In [194]:
df2.to_csv('clean_data.csv', encoding='utf8')