## What is Pandas?

pandas is a Python library containing a set of functions and specialised data structures that have been designed to help Python programmers to perform data analysis tasks in a structured way.

Most of the things that pandas can do can be done with basic Python, but the collected set of pandas functions and data structure makes the data analysis tasks more consistent in terms of syntax and therefore aids readabilty.

Particular features of pandas that we will be looking at include:


* Reading csv files
* Select specific columns
* Select specific rows
* Provide simple summary statistics on numeric columns
* Aggregation across columns
* Creating new columns
* Simple plotting of data



## Importing the pandas library

Importing the pandas library is done in exactly the same way as for any other library. In almost all examples of Python code using the pandas library, it will have been imported and given an alias of 'pd'. Who are we to break with tradition.


In [None]:
import pandas as pd

## Pandas data structures

There are 2 main data structure used by pandas , they are the Series and the Dataframe. The Series equates in general to a vector or a list. The data frame is equivalent to a table. Each column in a pandas dataframe is a pandas Series data structure.

We will mainly be looking at the Dataframe. 


Dataframes can be created directly in code.

In [None]:
df1 = pd.DataFrame([[1, 2], [1, 4], [5, 6]],
columns=['A', 'B'])
df1

It is however more usual to create a Pandas dataframe by reading a csv file.

## Reading a csv file

When we read a csv dataset in base Python we did so by opening the dataset, reading and processing a record at a time and then closing the dataset after we had read the last record. Reading dataset's in this way is slow and places all of the responsibility for extracting individual data items of information from the records on the programmer. 

The main advantage of this approach, however, is that you only have to store one dataset record in memory at a time. This means that if you have the time, you can process datasets of arbitarily large sizes.

In Pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset. All of the dataset records are assembled into a dataframe. If your dataset has column headers in the first record then these can be used as the dataframe column names. You can explicitly state this in the parameters to the call, but Pandas is usually able to infer that there ia a header row and use it automatically.

In [None]:
my_df = pd.read_csv("D:\\Intro_to_programming_09042018\\data\\geog.csv")

We can get various information about the newly created dataframe

In [None]:
# first 5 lines 
my_df.head()

In [None]:
# number of rows and columns
my_df.shape

In [None]:
# rows only - another use of the len function
len(my_df)
    
    

In [None]:
# first and lat few rows. If there was a lot of columns you would only get the first and last few of those as well
my_df


In [None]:
# count number of Not NA or missing values for each variable
my_df.count()

In [None]:
# number of unique values for a specific column
print(len(my_df['NUTS4'].unique()))

In [None]:
# List of columns
my_df.columns

In [None]:
# or alternatively as a proper list
list(my_df)

In [None]:
# You can iterate over the list to get information on the individual columns
for x in list(my_df) :
    print(len(my_df[x].unique()), "\t", x)

In [None]:
# values of specific column
print(my_df.ACORN_Category.unique())

# values of specific column
print(my_df.fuelTypes.unique())

# values of specific column
x = my_df.fuelTypes.unique()[0]

print(x)

In [None]:
# An alternative way of specifying the column name

y = my_df['ACORN_Category'].unique()
print(y)

In [None]:
# count of each value in column  - without using group_by

a = my_df.fuelTypes.value_counts()
print(a)

In [None]:
# for numeric columns, you can get basic statistics
# The warning is because there are a few missing values

my_df.describe()

In [None]:
# select rows based on column criteria

my_df[my_df['ACORN_Category'] == 0]

In [None]:
# select rows based on column criteria

my_df[my_df['anonID'] < 25]


In [None]:
# Rows containing missing values can be removed by using the dropna method

print(my_df.count())
print("\n\n")
my_df = my_df.dropna()
print(my_df.count())

In [None]:
# now run the describe again
# for numeric columns, you can get basic statistics
# no warning because there are now no missing values

my_df.describe()

In [None]:
# more complex selections

my_df[my_df.ACORN_Category == 6]

#my_df[(my_df.anonID <= 10) & (my_df.ACORN_Group == 'M')]

#my_df[(my_df.anonID <= 10) & (my_df.ACORN_Group == 'M')][['anonID', 'fuelTypes']]

In [None]:
# values of specific column

my_df.ACORN_Group.unique()

In [None]:
print(my_df.groupby("ACORN_Group").size())

In [None]:
# group by specific column and count of values
x = my_df.groupby("ACORN_Category")
x

In [None]:
# creating a new column

my_df['Tout_Total']  = my_df.Elec_Tout + my_df.Gas_Tout
my_df

In [None]:
# deleting a column

del my_df['Tout_Total']

# or

#my_df.drop('Tout_Total', 1)

my_df

In [None]:
# The simple plotting functions of Pandas, make use of the Matplotlib package, so it has to be loaded.

import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# Histogram of ACORN_Category

my_df['ACORN_Category'].plot.hist()

In [None]:
# Scatterplot

my_df.plot.scatter(x='ACORN_Category', y='ACORN_Type')