# Pandas

Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). You can import the data directly from a file(CSV, Excel spreadsheet) or an SQL query. Panda create a `DataFrame` object to hold the data. DataFrames have rows and columns, each column has a name(string) and each row has an index(integer). The actual values can be strings, integers, floats, tuples, etc.

#### Create a DataFrame using a Dictionary

You can create a `DataFrame` with a dictionary using `pd.DataFrame()`. Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error.

```py
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})
```

```py
address	age	name
123 Main St.	34	John Smith
456 Maple Ave.	28	Jane Doe
789 Broadway	51	Joe Schmo
```

In [1]:
import pandas as pd

df1 = pd.DataFrame({
  'Product ID': [1, 2, 3, 4],
  'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'],
  'Color': ['blue', 'green', 'red', 'black']
})

print(df1)

   Product ID Product Name  Color
0           1      t-shirt   blue
1           2      t-shirt  green
2           3        skirt    red
3           4        skirt  black


#### Create a DataFrame using a List

Each inner list represents a row. Use the keyword `columns` to set the column names. The order of column names matching the order the values appear in the inner list/row.

In [2]:
df2 = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
],
  columns=[
    'Store ID', 'Location', 'Number of Employees'
  ])

print(df2)

   Store ID       Location  Number of Employees
0         1      San Diego                  100
1         2    Los Angeles                  120
2         3  San Francisco                   90
3         4     Sacramento                  115


#### Create a DataFrame using a File

Most of the time we'll be importing CSV files(text files of comma separated values). The can be obtained from online data sets, exports of Excel or Google Sheets, or exports from SQL databases.

The first row of a CSV contains column headings. All subsequent rows contain values. Each column heading and each variable is separated by a comma(NO spaces following commas):

```txt
column1,column2,column3
value1,value2,value3
```

```txt
name,cake_flavor,frosting_flavor,topping
Devil's Food,chocolate,chocolate,chocolate shavings
Birthday Cake,vanilla,vanilla,rainbow sprinkles
Carrot cake,carrot,cream cheese,almonds
```

To load the csv into a `DataFrame`, use `pd.read_csv()`, the csv file is passed as an argument.

We can read data from a `DataFrame` and write it to a csv using the `.to_csv()` method. The method is called on the `DataFrame` object, the name of the csv file is passed as an argument, saving the file to the current directory.

```py
df2.to_csv('new-csv-file.csv')
```

#### Inspecting a DataFrame

Using the `head()` method - by default returns the header row and the first 5 rows. Pass an interger argument, and fetch that number of rows, e.g. `df.head(10)`

To view information about the dataset, number and names of columns, datatypes and memory usage, use `df.info()`.

#### Selecting a single column

To select a column of data values, `Series`, use the `df['column name']` format. If the column follows all the rules for naming variables, e.g. doesn't start with a number, contain spaces or special characters, etc, then you can use `dot notation`, e.g. `df.my_column_name`.

In [5]:
df3 = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north = df3.clinic_north

In [6]:
print(clinic_north)

0    100
1     45
2     96
3     80
4     54
5    109
Name: clinic_north, dtype: int64


In [7]:
print(type(clinic_north))

<class 'pandas.core.series.Series'>


In [8]:
print(type(df3))

<class 'pandas.core.frame.DataFrame'>


#### Selecting multiple columns

To select two or more columns from a DataFrame, we use a comma separated list of column names.

```py
new_df = orders[['last_name', 'email']]
```

Note: you need to use a double set of aquare brackets, e.g. `([[]])`

In [9]:
clinic_north_south = df3[['clinic_north', 'clinic_south']]
print(clinic_north_south)

   clinic_north  clinic_south
0           100            23
1            45           145
2            96            65
3            80            54
4            54            54
5           109            79


In [10]:
print(type(clinic_north_south))

<class 'pandas.core.frame.DataFrame'>


#### Selecting a single row

DataFrames are zero indexed, you can fetch a single row by passing it's index to `iloc[]`. The result is a Pandas Series.

In [11]:
march = df3.iloc[2]
print(march)

month           March
clinic_east        81
clinic_north       96
clinic_south       65
clinic_west        96
Name: 2, dtype: object


In [12]:
print(type(march))

<class 'pandas.core.series.Series'>


#### Selecting multiple rows

Use the same technique for selecting inner lists within a Python 2D list

`df.iloc[3:7]` would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)

`df.iloc[:4]` would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)

`df.iloc[-3:]` would select the rows starting at the 3rd to last row and up to and including the final row

In [13]:
april_may_june = df3.iloc[3:6]
print(april_may_june)

   month  clinic_east  clinic_north  clinic_south  clinic_west
3  April           80            80            54          180
4    May           51            54            54          154
5   June          112           109            79          129


In [14]:
print(type(april_may_june))

<class 'pandas.core.frame.DataFrame'>


#### Selecting a subset of data using Logic

```py
# select a specific column
df[df.MyColumnName == desired_column_value]

# select data based on a specific value
df[df.age == 30]

# select all rows that meet a particular condition
df[df.age < 30]

# select all rows that do NOT meet a condition
df[df.name != 'Clara Oswald']
```

You can also combine multiple logical statements, as long as each statement is in parentheses.

```py
# select all rows where the customer's age was under 30 or the customer's name was "Martha Jones"
df[(df.age < 30) | (df.name == 'Martha Jones')]
```

In [15]:
march_april = df3[(df3.month == 'March') | (df3.month == 'April')]
print(march_april)

   month  clinic_east  clinic_north  clinic_south  clinic_west
2  March           81            96            65           96
3  April           80            80            54          180


You can also use the `isin()` method check if a particular value exists and return the corresponding row, e.g. select the rows where the customer's name is either "Martha Jones", "Rose Tyler" or "Amy Pond"

```py
df[df.name.isin(['Martha Jones',
     'Rose Tyler',
     'Amy Pond'])]
```

In [16]:
january_february_march = df3[df3.month.isin(['January', 'February', 'March'])]
print(january_february_march)

      month  clinic_east  clinic_north  clinic_south  clinic_west
0   January          100           100            23          100
1  February           51            45           145           45
2     March           81            96            65           96


In [17]:
print(type(january_february_march))

<class 'pandas.core.frame.DataFrame'>


#### Reseting DataFram Indices

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. We can fix this using the method `.reset_index()`.

In [18]:
march_april.reset_index()

Unnamed: 0,index,month,clinic_east,clinic_north,clinic_south,clinic_west
0,2,March,81,96,65,96
1,3,April,80,80,54,180


By default, a new `index` column is created with the old indicies and and the indicies reset. You can avoid the `index` column being created by using the `drop=True` option.

In [19]:
march_april.reset_index(drop=True)

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,March,81,96,65,96
1,April,80,80,54,180


`.reset_index()` returns a new `DataFrame`, you can avoid this and instead modify the existing data frame with the `inplace=True` option.

In [22]:
df4 = df3.loc[[1, 3, 5]]
print(df4)

      month  clinic_east  clinic_north  clinic_south  clinic_west
1  February           51            45           145           45
3     April           80            80            54          180
5      June          112           109            79          129


In [23]:
df5 = df4.reset_index()
print(df5)

   index     month  clinic_east  clinic_north  clinic_south  clinic_west
0      1  February           51            45           145           45
1      3     April           80            80            54          180
2      5      June          112           109            79          129


In [24]:
print(df4)

      month  clinic_east  clinic_north  clinic_south  clinic_west
1  February           51            45           145           45
3     April           80            80            54          180
5      June          112           109            79          129


In [26]:
df4.reset_index(drop=True, inplace=True)
print(df4)

      month  clinic_east  clinic_north  clinic_south  clinic_west
0  February           51            45           145           45
1     April           80            80            54          180
2      June          112           109            79          129


#### Example:

```py
import pandas as pd
orders = pd.read_csv('shoefly.csv')

print(orders.head(20))

# fetch all email addresses
emails = orders.email
print(emails)

# find the matching order
frances_palmer = orders[(orders.first_name == 'Frances') & (orders.last_name == 'Palmer')]
print(frances_palmer)

# select all orders of shoe_type: clogs, boots & ballet flats
comfy_shoes = orders[(orders.shoe_type == 'clogs') | (orders.shoe_type == 'boots') | (orders.shoe_type == 'ballet flats')]
print(comfy_shoes)
```