# How to handle data in python using Pandas. 
Pandas is like Excel, but does not suck.  Anything you need to do with data, you can do with pandas.  Im going to focus on the 20% of the things that do 80% of the work. 

In [None]:
import pandas as pd
import numpy as np

# Ignore these next two lines
from IPython.display import Image
pd.set_option('display.max_rows', 5)

# Importing data
* If data is seperated by `;`, change to `sep=';'`
* Shift+Tab is your best friend

In [None]:
df = pd.read_csv('data/listings.csv', sep=',')
df.head()

# Basic Syntax
* Select one column
* Select multiple columns
   * Must use a list

In [None]:
# SELECT ONE COLUMN
df['id']

In [None]:
# SELECT MULTIPLE COLUMNS
df[ ['id', 'name'] ]

### Selecting rows.

In [None]:
df.iloc[0]

In [None]:
rows = [0,10,100]
df.loc[rows]

### Select row and column

In [None]:
df.at[ 0, 'name' ]

# Basic Math
Lets say we wanted to up the price 100x

In [None]:
df['new_price'] = df['price'] * 100

In [None]:
df.price.sum()

In [None]:
df.price.mean()

Let's find the minimum booking amout by multiplying price by minimum_nights
* note, you can also access columns by using a dot `.`

In [None]:
df['min_booking_amount'] = df.price * df.minimum_nights
df.head()

# Selecting / Filtering data

In [None]:
# Create our select condition.
select_condition = df['neighbourhood_group'] == 'Manhattan'

# Print the dataframe that meets our condition. 
df[ select_condition ]

In [None]:
# Select Brooklyn
condition_1 = df['neighbourhood_group'] == 'Brooklyn'

# Select Prices that are higher than 500
condition_2 = df['price'] > 500

# Select all data that meets both requirments
df[condition_1 & condition_2]

## Using 'or' statments
* What if we want to find all listings in Midtown or DUMBO
* The 'or' in python is the pipe thing `|`

In [None]:
condition_1 = df.neighbourhood == 'Midtown'

condition_2 = df.neighbourhood == 'DUMBO'

# Select all data that meets condition 1 or condition 2
df[ condition_1 | condition_2]

## Using `.isin`. A very handy selector tool.

In [None]:
# Here is a list of ids we want to select
list_of_host_ids = [19303369, 29871437, 63953718, 32084117]

# Make our select condition 
condition = df.host_id.isin(list_of_host_ids)

# Make the selection
df[condition]

In [None]:
# Here is a list of ids we want to select
list_of_host_ids = [19303369, 29871437, 63953718, 32084117]

# Make our select condition 
condition = df.host_id.isin(list_of_host_ids)

### USING THE `~` WILL SELECT THE INVERSE OF THE CONDITION
df[~condition]

In [None]:
list_of_neighbourhoods = ['DUMBO', 'Midtown']

condition = df.neighbourhood.isin(list_of_neighbourhoods)

df[condition]

# Assigning values to filtered columns using `np.where()`
* np.where is a handy tool that takes in a condition statement, followed by what value to set when condition is true, and then what value to set when condition is false.  
* `np.where(condition, when_true, when_false)` 

### We want a new column that whenever a listing is in Williamsburg, set it equal to 1, else, 0. 

In [None]:
c1 = df['neighbourhood'] == 'Williamsburg'

df['is_in_williamsburg'] = np.where(c1, 1, 0)

df.head()

In [None]:
# You can do it with multiple conditions as well...

# First select condition.
c1 = df['neighbourhood'] == 'Williamsburg'

# Second select condition.
c2 = df['room_type'] == 'Private room'

# Set equal to one when both conditions are true, and zero when not true.
df['private_room_in_williamsburg'] =  np.where( c1 & c2, 1, 0 )

# Sanity check to view if our assignment worked correctly.
df[df['private_room_in_williamsburg'] == 1]

# Applying a functions to a column

In [None]:
def my_function(x):
    if 'Furnished' in str(x):
        return 1
    else:
        return 0

df['is_furnished'] = df.name.apply(my_function)

df.head()

# Grouping

In [None]:
Image('https://i.stack.imgur.com/sgCn1.jpg')

Doing a groupby does nothing unless you apply a function to a column of the groupby object.

In [None]:
df.groupby('neighbourhood_group')

## After the groupby, select a column then select the function you want to perform on said column.

In [None]:
df.groupby('neighbourhood_group')['price'].mean()

# You can use `.agg` to do multiple functions on said column.

In [None]:
df.groupby('neighbourhood_group')['price'].agg(['count', 'min', 'max', 'mean', 'median', 'std', 'sum'])

# Double groupby

In [None]:
groupby_cols = ['neighbourhood_group', 'neighbourhood']
df.groupby(groupby_cols)['price'].mean()

# Making a DataFrame out of a groupby 
* This is somewhat nuanced, but something I find people struggle on and is very handy to know how to do.

In [None]:
# Do your group by
gb = df.groupby('neighbourhood_group')['price'].mean()

# Convert it to a DataFrame
new_df = pd.DataFrame(gb)

# Reset the index
new_df = new_df.reset_index()

# Check it out
new_df

# Very helpful tools
* `df.describe()`
* `df.info`
* `df.shape`
* `df['column'].value_counts()`
* `df['column'].apply(function)`
* `df.dropna(subset=[columns])`

In [None]:
df.describe()

In [None]:
df.info()

# How to find percentages of the whole using `value_counts`

In [None]:
df.neighbourhood_group.value_counts()

In [None]:
dfp = df.neighbourhood_group.value_counts() / df.neighbourhood_group.value_counts().sum()
dfp

# Joins

In [None]:
Image('https://i.stack.imgur.com/UI25E.jpg')

In [None]:
dfprices = pd.read_csv('data/prices.csv')
dfprices

In [None]:
dflistings = pd.read_csv('data/n_listings.csv')
dflistings

In [None]:
# Merge defaults to an inner-join
pd.merge(dfprices, dflistings, on='neighbourhood_group')

In [None]:
# Doing a left join keeps all values in the left table
dfjoined = pd.merge(dfprices, dflistings, on='neighbourhood_group', how='left')
dfjoined

# Exporting data
* Almost always use `index=False` when saving your csv

In [None]:
# This is the name of your new file
save_as = 'my-data-file.csv'

# This is the method you use, DONT FORGET INDEX=FALSE!
df.to_csv(save_as, index=False)

In [None]:
# Another fun way to get your data out.
df.to_clipboard(index=False)

# Now open up the Exercise notebook and begin your journey into Pandas.