  # Data Cleaning with Pandas #

Welcome to your introduction to Pandas and Jupyter Notebook!

Today we're going to learn how to read in a csv file, create a dataframe, identify different ways it may be dirty and learn some techniques for cleaning up our data set. 

karrie.anne.kehoe@gmail.com/@karriekehoe

## Getting to grips with Jupyter Notebook and Pandas

Jupyter Notebook is an interactive, browser based programing environment. It can be used for multiple programming languages, for writing documentation and visualising data. If you want to learn more about what Jupyter Notebook can read its documentation at http://jupyter-notebook.readthedocs.io/en/latest/notebook.html

Pandas is a python library, designed for statistical analysis. It's very flexible, easy to use and has a range of useful built in functions. If you want to learn more about what Pandas can do, you can read its documentation at http://pandas.pydata.org/ or browse the cook book at http://pandas.pydata.org/pandas-docs/version/0.18.1/tutorials.html  

### Shortcuts:
* `esc` - takes you into command mode
* `a` - insert cell above
* `b` - insert cell below
* `shift then tab` will show you the documentation for your code
* `shift and enter` will run your cell
* ` d d` will delete a cell

### Our Data

Today we're going to look at Farm subsidaries given to Irish companies and groups from 2013. Courtesy of https://farmsubsidy.org

### Terminology ###

**Dataframe** - a dataframe is a two dimensional tabular data structure with labeled axes

**Series** - a series is similar to a list, array or a single column within a dataframe

## Starting off

First we need to import Pandas our python library to do so we use the line of code below. We use 'pd' as an alias to make it easier when typing in our code.
We are going to type 

`import pandas as pd`

Now we create a dataframe and read in our csv.

`df = pd.read_csv('filepath')`

Let's look at the first ten rows of our data

`df.head(10)`

Next we need to know what data types we're dealing with for each column in our dataframe

`df.dtypes`

We use .shape to find the dimensions of our data

`df.shape`

Let's get a list of our column headers to make sure there aren't any problems with them

## Data Problems: 
* Location has commas in the name
* Total amount are python objects and not ints or floats, so we can't perform any calculations on them
* There is something weird looking with the year column
* The country column seems to be dirty

Before we change anything we're going to create a copy of our dataframe and clean that up

`df2.copy()`

## Cleaning strings

We need to clean up the Total amount column and convert it to an integer so we can count it. How do we check that it's worked?

`df2['col'] = df2['col'].str.replace('€', '')`

Next thing is to remove those pesky commas from our data

Now check to see if that worked

It looks like the Country column is dirty too, let's clean that

Let's use a string replacement to standardise our data  `df2['Col'] = df2['Col'].str.replace('x', '')`

## Changing data types

Ok, no luck. We need to explicitly change the data type for the new Value clean column.

`df2['Value_clean'] = pd.to_numeric(df2['Value_clean'])`

## Dropping and re-naming columns

Let's clean up our dataframe a bit by dropping the Type column 
`df2 = df2.drop(columns="Type ")`

Now that's gone, let's rename the Clean column

`df2 = df2.rename(columns={'old_name': 'new_name'})`

Let's make sure there aren't any trail or leading spaces in the column names, if so this could cause havoc

`df2.columns`

### Analysis

Now we have a lovely clean data set and let's dig in a little and see which group go the most money

That's not very clear, let's sort the data

## Saving our data

Ok finally let's save our clean for the next class

`df2.to_csv('clean_data.csv', encoding='utf8')`