# Pandas

Pandas is a Python module for **working with tabular data** (i.e., data in a table with rows and columns). 

You can **import data** directly from a **file**, e.g. csv or excel spreadsheet, or an **sql query**. 

Data is imported into a pandas `dataframe` object: 

- dataframes have rows and columns.
- each column has a name(string) and each row has an index(integer). 
- the actual values can be strings, integers, floats, tuples, etc.

### Create a DataFrame from a Dictionary

You can create a `dataframe` with a dictionary using `pd.DataFrame()`. 

- each key becomes a column name.
- each value is a python list which become the column values. 
- the columns must all be of the same length or an error is raised.

```py
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})
```

```py
address	age	name
123 Main St.	34	John Smith
456 Maple Ave.	28	Jane Doe
789 Broadway	51	Joe Schmo
```

In [1]:
import pandas as pd

df_dict = pd.DataFrame({
  'Product ID': [1, 2, 3, 4],
  'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'],
  'Color': ['blue', 'green', 'red', 'black']
})
df_dict.head()

Unnamed: 0,Product ID,Product Name,Color
0,1,t-shirt,blue
1,2,t-shirt,green
2,3,skirt,red
3,4,skirt,black


### Create a DataFrame from a List

Using `pd.DataFrame()`, takes two args, 1st a nested list(the values) OR a 2D numpy array, 2nd arg a list of strings(which are the column names).

- each inner list represents a row.
- use the keyword `columns` to set the column names. 
- the order of column names matching the order the values appear in the list.

General syntax:

```py
df = pd.DataFrame([[values]], columns=[headings])

# Alternatively
df = pd.DataFrame([[values]])
df.columns = ['name1', 'name2']
```

In [2]:
df_list = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
],
  columns=[
    'Store ID', 'Location', 'Number of Employees'
  ])

df_list.head()

Unnamed: 0,Store ID,Location,Number of Employees
0,1,San Diego,100
1,2,Los Angeles,120
2,3,San Francisco,90
3,4,Sacramento,115


In [3]:
df_list2 = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
])
df_list2.head()

Unnamed: 0,0,1,2
0,1,San Diego,100
1,2,Los Angeles,120
2,3,San Francisco,90
3,4,Sacramento,115


In [4]:
df_list2.columns = ['Store ID', 'Location', 'Number of Employees']
df_list2.head()

Unnamed: 0,Store ID,Location,Number of Employees
0,1,San Diego,100
1,2,Los Angeles,120
2,3,San Francisco,90
3,4,Sacramento,115


By default each dataframe has an **index column** of integer values, starting at `0` and incrementing by `1`, which is used to identify each row. To set the index column on the dataframe and provide a custom identifier, use the `.index` attribute.

In [5]:
df_list.index = ['SD', 'LA', 'SF', 'SA']
df_list.head()

Unnamed: 0,Store ID,Location,Number of Employees
SD,1,San Diego,100
LA,2,Los Angeles,120
SF,3,San Francisco,90
SA,4,Sacramento,115


### Create a DataFrame from a File

Most of the time we'll be importing `csv` text files(comma separated values). 

- we can import files using other delimiters, e.g. tab or semi-colon.
- obtained from online data sets, exports from Excel or Google Sheets, or SQL databases.
- by default pandas assumes the first row of the file contains column headings. All subsequent rows are assumed to be values.
- where no heading row is provided, use `headers=None` and `names=[column names list]` properties. If you want to replace a header row, use `skiprows=1` with the names attribute.
- each field(column heading variable value) needs to be separated by a comma(or other delimiter) - NO spaces following the delimiter is allowed.

To load the csv into a `DataFrame`, use `pd.read_csv()`, the csv file is passed as the 1st argument, and (optionally) a separator. By default the `,` separator is used, but you can use `;`, `' '`, etc , as long as you specify the `sep` keyword.

```py
df = pd.read_csv('path/to/file', sep=',')
```

In [6]:
df_brics = pd.read_csv('data/brics.csv')
df_brics.head()

Unnamed: 0.1,Unnamed: 0,country,capital,area,population
0,BR,Brazil,Brasilia,8.516,200.4
1,RU,Russia,Moscow,17.1,143.5
2,IN,India,New Delhi,3.286,1252.0
3,CH,China,Beijing,9.597,1357.0
4,SA,South Africa,Pretoria,1.221,52.98


When the csv file contains the index column, pandas will treat it as a column in it's own right and add a zero-based index column to the dataframe.

To avoid this(or to set one of the columns as the `index` column), use the `index_col` property to define which column to use as the `index` column.

In [7]:
df_brics = pd.read_csv('data/brics.csv', index_col=0)
df_brics.head()

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


We can export data from a dataframe to csv using the `.to_csv()` method. The method is called on the dataFrame object, the name of the csv file is passed as an argument, saving the file to the current directory. Add `index=False` so the index is not exported with the data.

```py
df.to_csv('new-csv-file.csv', index=False)
```

### Inspecting a DataFrame

Using the `head()` method - by default returns the header row and the first 5 rows. Pass an interger argument, and fetch that number of rows, e.g. `df.head(10)`.

To view information about the dataset, such as the number of samples, number and names of columns, number of non-null values per column(number of fields that have a value) datatypes and memory usage, use `df.info()`.

In [8]:
df_brics.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, BR to SA
Data columns (total 4 columns):
country       5 non-null object
capital       5 non-null object
area          5 non-null float64
population    5 non-null float64
dtypes: float64(2), object(2)
memory usage: 200.0+ bytes


#### Selecting a single column

To select a column of data values, `Series`, use the `df['column name']` format. If the column follows all the rules for naming variables, e.g. doesn't start with a number, contain spaces or special characters, etc, then you can use `dot notation`, e.g. `df.my_column_name`.

Selecting a single column of data always returns a `Series`.

In [9]:
df_list3 = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)
df_list3.head()

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,January,100,100,23,100
1,February,51,45,145,45
2,March,81,96,65,96
3,April,80,80,54,180
4,May,51,54,54,154


In [10]:
clinic_north = df_list3.clinic_north
print(type(df_list3)) # data frame
print(type(clinic_north)) # series
print(clinic_north)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
0    100
1     45
2     96
3     80
4     54
5    109
Name: clinic_north, dtype: int64


#### Selecting multiple columns

To select two or more columns from a dataframe, use a comma separated list of column names, any column in any order. Selecting 2 or more columns always returns a dataframe.

General syntax:

```py
new_df = df[['col_name1', 'col_name4']]
```

Note: you need to use a double set of aquare brackets, e.g. `([[]])`

In [11]:
clinic_north_south = df_list3[['clinic_north', 'clinic_south']]
print(type(clinic_north_south))
print(clinic_north_south)

<class 'pandas.core.frame.DataFrame'>
   clinic_north  clinic_south
0           100            23
1            45           145
2            96            65
3            80            54
4            54            54
5           109            79


#### Selecting a single row

DataFrames are zero indexed, you can fetch a single row by passing it's index to `iloc[]`. The result is a Pandas Series.

In [12]:
march = df_list3.iloc[2]
print(type(march))
print(march)

<class 'pandas.core.series.Series'>
month           March
clinic_east        81
clinic_north       96
clinic_south       65
clinic_west        96
Name: 2, dtype: object


#### Selecting multiple rows

You can use the same `slicing` technique used for Python lists, for selecting multiple rows. 

- the result is always a dataframe.
- all columns are returned.

`df.iloc[3:7]` would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)

`df.iloc[:4]` would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)

`df.iloc[-3:]` would select the rows starting at the 3rd to last row and up to and including the final row

Select a selection of rows from a selection of columns, `df[['column1', 'column2']].iloc[3:9]`.

In [13]:
april_may_june = df_list3.iloc[3:6]
print(type(april_may_june))
print(april_may_june)

<class 'pandas.core.frame.DataFrame'>
   month  clinic_east  clinic_north  clinic_south  clinic_west
3  April           80            80            54          180
4    May           51            54            54          154
5   June          112           109            79          129


#### Selecting a subset of rows using Logic

```py
# select a specific column
df[df.MyColumnName == desired_column_value]

# select rows based on a specific value
df[df.age == 30]

# select all rows that meet a particular condition
df[df.age < 30]

# select all rows that do NOT meet a condition
df[df.name != 'Clara Oswald']
```

You can also combine multiple logical statements using `&` and `|`.

- ensure each statement is in parentheses.

In [14]:
df_list3[(df_list3.month == 'March') | (df_list3.month == 'April')]

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
2,March,81,96,65,96
3,April,80,80,54,180


In [15]:
df_list3[(df_list3.clinic_north > 80) & (df_list3.clinic_south > 70)]

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
5,June,112,109,79,129


You can also use the `isin()` method to check if a particular value exists and return the corresponding row(s), e.g. select the rows where the customer's name is either "Martha Jones", "Rose Tyler" or "Amy Pond"

```py
df[df.name.isin(['Martha Jones', 'Rose Tyler', 'Amy Pond'])]
```

In [16]:
combo = df_list3[df_list3.month.isin(['March', 'May', 'April'])]
print(type(combo))
combo

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
2,March,81,96,65,96
3,April,80,80,54,180
4,May,51,54,54,154


#### Reseting DataFrame Indices

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. We can fix this using the method `.reset_index()`, returns a new dataframe.

In [17]:
combo = combo.reset_index()
combo

Unnamed: 0,index,month,clinic_east,clinic_north,clinic_south,clinic_west
0,2,March,81,96,65,96
1,3,April,80,80,54,180
2,4,May,51,54,54,154


By default, a new `index` column is created with the old indicies and and the indicies reset. You can avoid the `index` column being created by using the `drop=True` option.

In [18]:
combo = df_list3[df_list3.month.isin(['March', 'May', 'April'])]
combo = combo.reset_index(drop=True)
combo

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,March,81,96,65,96
1,April,80,80,54,180
2,May,51,54,54,154


`.reset_index()` returns a new `DataFrame`, you can avoid this and instead modify the existing data frame with the `inplace=True` option.

In [22]:
selection = df_list3[df_list3.month.isin(['March', 'May', 'April'])]
selection.reset_index(drop=True, inplace=True)
selection

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,March,81,96,65,96
1,April,80,80,54,180
2,May,51,54,54,154


#### Example:

```py
import pandas as pd
orders = pd.read_csv('shoefly.csv')

print(orders.head(20))

# fetch all email addresses
emails = orders.email
print(emails)

# find the matching order
frances_palmer = orders[(orders.first_name == 'Frances') & (orders.last_name == 'Palmer')]
print(frances_palmer)

# select all orders of shoe_type: clogs, boots & ballet flats
comfy_shoes = orders[(orders.shoe_type == 'clogs') | (orders.shoe_type == 'boots') | (orders.shoe_type == 'ballet flats')]
print(comfy_shoes)
```