# PANDAS

**Pandas** is the specific Python library to work with dataframes

In Python we define Dataframes as multidimensional arrays with row and column labels that can contain heterogeneous type of data, as well as missing values

**Install required libraries**

To work with Pandas, we have to had installed Numpy previously, as it builds up from somem functions included on that library too

In [4]:
import numpy as np

In [5]:
import pandas as pd

As we saw in previous show and tell sessions, we can identify the three fundamental types of Pandas data structures: **series**, **DataFrame** and **Index**

## 1. Pandas series object 

As in Numpy objects, a Pandas Series object is defined as a one-dimensional array. And in Pandas it includes an index

In [28]:
Seriespd = pd.Series([0.2,0.8,0.9,1.0,1.8])

In [29]:
Seriespd

0    0.2
1    0.8
2    0.9
3    1.0
4    1.8
dtype: float64

A unique feature from Pandas objects is the inclusion of an index as part of the Pandas Series

The special feature of a Pandas series is that we can access any Series element by using both the index and values attributes

Pandas Series **values** are accessed using the **values** function

In [8]:
Seriespd.values

array([0.2, 0.8, 0.9, 1. , 1.8])

Pandas Series **index** are accessed using the **index** function

In [9]:
Seriespd.index

RangeIndex(start=0, stop=5, step=1)

## 1.1. Pandas series indexing

We use a similiar notation as the one used in Numpy to access Pandas arrays

In [10]:
Seriespd[1]

0.8

In [11]:
Seriespd[2:4]

2    0.9
3    1.0
dtype: float64

Remember that with indexing the right part of the indexing :R always returns one element less that what the index shows 

In [12]:
Seriespd[2:5]

2    0.9
3    1.0
4    1.8
dtype: float64

## 1.2 Pandas special indexing features

We can use different types of values in our indexes, not just numeric ones

In [19]:
Mydata = pd.Series([2,4,6,7,9,12],index=['a','b','c','d','e','f'])

In [20]:
Mydata

a     2
b     4
c     6
d     7
e     9
f    12
dtype: int64

## 1.3 Building Pandas Series objects

In [30]:
Buildseries = pd.Series([2,8,9,10,11])

In [31]:
Buildseries

0     2
1     8
2     9
3    10
4    11
dtype: int64

In the example above, a Numpy array is created and the index is the integer sequence starting in 0

## 2. Pandas DataFrame object 

The main difference between a Series object and a DataFrame object lies the extra **row indices** and flexible **column names**  features that DataFrame object include over Series objects 

In [20]:
region_pop_list = {'South East':9180135,'London':8961989,'North West':7341196,'East of England':6236072}

Then we transform it into a DataFrame using Pandas

In [21]:
region_pop = pd.Series(region_pop_list)

In [22]:
country

England             11528
Scotland             1054
Wales                 762
Northern Ireland      828
dtype: int64

In [23]:
covid_cases_list ={'South East':45496,'London':53804,'North West':98565,'East of England':32760}

In [24]:
covid_cases = pd.Series(covid_cases_list)

In [25]:
covid_cases

South East         45496
London             53804
North West         98565
East of England    32760
dtype: int64

We can use this dictionary to build a single two-dimensional object that will include all this information

In [26]:
England_regions = pd.DataFrame({'population':region_pop,'new_covid_cases':covid_cases})

In [27]:
England_regions

Unnamed: 0,population,new_covid_cases
South East,9180135,45496
London,8961989,53804
North West,7341196,98565
East of England,6236072,32760


In [None]:
Now we can access different elements of the Dataframe using the index attribute

In [29]:
England_regions.index

Index(['South East', 'London', 'North West', 'East of England'], dtype='object')

In [30]:
England_regions.columns

Index(['population', 'new_covid_cases'], dtype='object')

### We can use this dataframe to access individual elements using now column names

In [33]:
England_regions['population']

South East         9180135
London             8961989
North West         7341196
East of England    6236072
Name: population, dtype: int64

In [34]:
England_regions['new_covid_cases']

South East         45496
London             53804
North West         98565
East of England    32760
Name: new_covid_cases, dtype: int64

### Data Selection in Series

Indexer using loc, iloc and ix

In [36]:
newdata = pd.Series(['d','e','f'], index =[1,2,6])

In [37]:
newdata

1    d
2    e
6    f
dtype: object

In [38]:
newdata[1]

'd'

Later on we will use extensively the iloc function to slice Dataframes in coming weeks. It is a Purely integer-location based indexing for selection by position.

Check the Pandas official website on how to use the iloc function [iloc function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)

In [40]:
newdata.iloc[0]

'd'

In [44]:
newdata.iloc[0:2]

1    d
2    e
dtype: object

### DataFrames as two dimensional arrays

We can view dataframes as a dimensional array like this below

In [46]:
England_regions.values

array([[9180135,   45496],
       [8961989,   53804],
       [7341196,   98565],
       [6236072,   32760]], dtype=int64)

This is can be turned into a dataframe

We can introduce new data manipulations in Dataframes such as Transpose function

In [48]:
England_regions.T

Unnamed: 0,South East,London,North West,East of England
population,9180135,8961989,7341196,6236072
new_covid_cases,45496,53804,98565,32760


In [None]:
We can access a single column

In [49]:
England_regions['population']

South East         9180135
London             8961989
North West         7341196
East of England    6236072
Name: population, dtype: int64

In [50]:
England_regions

Unnamed: 0,population,new_covid_cases
South East,9180135,45496
London,8961989,53804
North West,7341196,98565
East of England,6236072,32760


When using the iloc() function in Pandas, we keep the index and column label from the dataframe

In [51]:
England_regions.iloc[:1,:1]

Unnamed: 0,population
South East,9180135


Subset population information from dataframe for two regions 

In [52]:
England_regions.iloc[:2,:1]

Unnamed: 0,population
South East,9180135
London,8961989


In [54]:
England_regions.iloc[2:3,:2]

Unnamed: 0,population,new_covid_cases
North West,7341196,98565


### Important convention

When using the .iloc() function we can get a good control of the rows and cols that we subset in a dataframe

**1** Access dataset by index

In [56]:
England_regions.iloc[1:2]

Unnamed: 0,population,new_covid_cases
London,8961989,53804


**2** Access dataset by slicing

## 3. Combine datasets using contenate and append functions 

We can use pd.concat() function to concatente two datasets

In [63]:
pop_list2 = {'CityA':91801,'CityB':89619,'CityC':73411,'CityD':62360}

In [64]:
polution  ={'CityA':45496,'CityB':53804,'CityC':98565,'CityD':32760}

In [65]:
Indic01 = pd.DataFrame({'population':pop_list2,'polution':polution})

In [66]:
Indic01

Unnamed: 0,population,polution
CityA,91801,45496
CityB,89619,53804
CityC,73411,98565
CityD,62360,32760


In [67]:
pop_list3 = {'CityA':91801,'CityB':89619,'CityC':73411,'CityD':62360}

In [68]:
polution2  ={'CityA':45496,'CityB':53804,'CityC':98565,'CityD':32760}

In [69]:
Indic02 = pd.DataFrame({'population':pop_list3,'polution':polution2})

In [70]:
Indic02

Unnamed: 0,population,polution
CityA,91801,45496
CityB,89619,53804
CityC,73411,98565
CityD,62360,32760


In [None]:
Then we concatenate those two dataframes

In [72]:
Alldata = pd.concat([Indic01,Indic02])

In [73]:
Alldata

Unnamed: 0,population,polution
CityA,91801,45496
CityB,89619,53804
CityC,73411,98565
CityD,62360,32760
CityA,91801,45496
CityB,89619,53804
CityC,73411,98565
CityD,62360,32760


Now we can try to **apend** them using this other method

In [75]:
Alldataappend = Indic01.append(Indic02)

In [76]:
Alldataappend

Unnamed: 0,population,polution
CityA,91801,45496
CityB,89619,53804
CityC,73411,98565
CityD,62360,32760
CityA,91801,45496
CityB,89619,53804
CityC,73411,98565
CityD,62360,32760


In [None]:
Concatdt = pd.concat([df1,df2])

## 4. Merge datasets

In [99]:
We use the pd.merge() function for a standard one-to-one merge in Pandas

SyntaxError: invalid syntax (<ipython-input-99-80f84d7260c5>, line 1)

In [100]:
df1 = pd.DataFrame({'name':['Andrew','John','Matthew'],
                   'gorup':['Accounting','Engineering','Sales']})

In [101]:
df1

Unnamed: 0,name,gorup
0,Andrew,Accounting
1,John,Engineering
2,Matthew,Sales


In [102]:
df2 = pd.DataFrame({'name':['Andrew','John','Matthew'],
                   'hire_date':[2004,2005,2012]})

In [103]:
df2

Unnamed: 0,name,hire_date
0,Andrew,2004
1,John,2005
2,Matthew,2012


In [104]:
df4 = pd.merge(df1,df2)

In [105]:
df4

Unnamed: 0,name,gorup,hire_date
0,Andrew,Accounting,2004
1,John,Engineering,2005
2,Matthew,Sales,2012


## 5. Pandas read_csv() function

We use read_csv() function in Pandas to import any comma delimited file into our workspace

In [None]:
pd.read_csv() 

One of the first steps in any data science project is to import data. In Python the read_csv() function allows that 

In [16]:
passengers_df = pd.read_csv("Air_Traffic_Passenger_Statistics.csv")

In [17]:
passengers_df.head()

Unnamed: 0,Activity Period,Operating Airline,Operating Airline IATA Code,Published Airline,Published Airline IATA Code,GEO Summary,GEO Region,Activity Type Code,Price Category Code,Terminal,Boarding Area,Passenger Count
0,200507,ATA Airlines,TZ,ATA Airlines,TZ,Domestic,US,Deplaned,Low Fare,Terminal 1,B,27271
1,200507,ATA Airlines,TZ,ATA Airlines,TZ,Domestic,US,Enplaned,Low Fare,Terminal 1,B,29131
2,200507,ATA Airlines,TZ,ATA Airlines,TZ,Domestic,US,Thru / Transit,Low Fare,Terminal 1,B,5415
3,200507,Air Canada,AC,Air Canada,AC,International,Canada,Deplaned,Other,Terminal 1,B,35156
4,200507,Air Canada,AC,Air Canada,AC,International,Canada,Enplaned,Other,Terminal 1,B,34090


#### List objects content in our working directory

In [15]:
ls

 Volume in drive C has no label.
 Volume Serial Number is F462-25D4

 Directory of C:\Users\Pablo.Leonrodenas\Python\000_Getting_started

07/10/2020  17:58    <DIR>          .
07/10/2020  17:58    <DIR>          ..
07/10/2020  17:00    <DIR>          .ipynb_checkpoints
29/09/2020  15:17             7,542 001_Part_1.ipynb
25/09/2020  15:47            18,413 004_Part_2.ipynb
06/10/2020  17:27            60,162 005_Part_3_Numpy.ipynb
07/10/2020  17:53             7,591 008_Pandas.ipynb
07/10/2020  17:56         2,321,760 Air_Traffic_Passenger_Statistics.csv
               5 File(s)      2,415,468 bytes
               3 Dir(s)  123,273,814,016 bytes free
