# Most Common Pandas Cheetsheet Questions

Let's face it. Pandas API can be pretty confusing. They sometimes use `camelCase` instead of `pascal_case` and the names of the functions are often not the easiest to remember. Is it `count_values` or `value_counts` or `values_counts`? I never know and that's why I end up searching for the same things over and over again. 

I decided to make this notebook to put all the common things I'm googling into one place. These are the most common questions I found useful on StackOverflow. Every answer will link to the original post whose authors deserve all the credit.

# Select rows based on column values from Pandas DataFrame

#### Columns that equal value

In [None]:
# some_value is scalar (e.g. a number)
df.loc[df['column_name'] == some_value]

# some_values is iterable (e.g. a list)
df.loc[df['column_name'].isin(some_values)]

# Use & to combine multiple conditions. Note the parantheses!
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

#### Columns that do not equal value

In [None]:
# Use != for rows which do not equal scalar some_value (e.g. a number)
df.loc[df['column_name'] != some_value]

# Use ~ for rows which do not equal iterable some_value (e.g. a list)
df.loc[~df['column_name'].isin(some_values)]

[Source](https://stackoverflow.com/a/17071908/7595633)

# Select multiple columns of Pandas DataFrame

In [None]:
# By name
df1 = df[['a', 'b']] # Note this produces a copy
# By index
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.

[Source](https://stackoverflow.com/a/11287278/7595633)

# Iterate over rows of Pandas DataFrame

Don't do it! It's not idiomatic. Vectorise your operations instead. Click [here for full reasoning](https://stackoverflow.com/a/55557758/7595633)

# Rename columns of Pandas DataFrame

In [None]:
import Pandas as pd
df = pd.DataFrame({'$a':[1], '$b': [10]})

#### All at once

In [None]:
df

Unnamed: 0,$a,$b
0,1,10


In [None]:
df.columns = ['first_column', 'second_column']
df

Unnamed: 0,first_column,second_column
0,1,10


#### Only some

In [None]:
df.rename(columns = {'first_column': 'new_name'}, inplace = True)
df

Unnamed: 0,new_name,second_column
0,1,10


[Source](https://stackoverflow.com/a/11346337/7595633)

# Delete columns of Pandas DataFrame

In [None]:
# columns
df.drop(columns=['B','C'])
# rows
df.drop(index=[0,1])


[Source](https://stackoverflow.com/a/18145399/7595633)

# Get row/column count of Pandas DataFrame

In [None]:
# rows
len(df.index)
# rows
len(df.column)
# both (but slow on big datasets)
rows_count, columns_count = df.shape


[Source](https://stackoverflow.com/a/15943975/7595633)

# Get list of column headers of Pandas DataFrame

In [None]:
# If you hate typing
list(df)
# If you hate not being explicit
list(df.columns.values)

[Source](https://stackoverflow.com/a/19483025/7595633)

# Rearange the order of columns of Pandas DataFrame

In [None]:
import Pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(1, 2))

In [None]:
# Original dataframe
df

Unnamed: 0,0,1
0,0.444825,0.867773


In [None]:
# Change column order
df = df[[1,0]]
df

Unnamed: 0,1,0
0,0.867773,0.444825


[Source](https://stackoverflow.com/a/13148611/7595633)

# Add new column to Pandas DataFrame

In [None]:
# Simple version
df['new_name'] = new_column
# Proper version recommended by Pandas
df = df.assign(new_name=new_column)

[Source](https://stackoverflow.com/a/12555510/7595633)

# Add new rom to Pandas DataFrame

Don't do it! It's slow and unidiomatic. Gather all the data first and only create the dataframe after.

In [None]:
df = pd.DataFrame(np.random.rand(1, 2))

In [None]:
# If you must:
new_row = {0: 'new', 1: 'row'}
df.append([new_row])

Unnamed: 0,0,1
0,0.374034,0.582879
0,new,row


[Source](https://stackoverflow.com/questions/10715965/create-Pandas-dataframe-by-appending-one-row-at-a-time/10716007#10716007)

# Drop rows whose values in a certain column is NaN in Pandas DataFrame

In [None]:
# Subset lists columns you care about
df.dropna(subset = ['column1_name', 'column2_name', 'column3_name'])

[Source](https://stackoverflow.com/q/13413590/7595633)

# Change column type in Pandas DataFrame

In [None]:
# convert column "a" to int64 dtype and "b" to np.float64 type
df = df.astype({"a": int, "b": np.float64})

[Source](https://stackoverflow.com/a/28648923/7595633)

# Delete row based on value of particular column from Pandas DataFrame

See also "how to select rows based on column values" for other options

In [None]:
df = df[df.relevant_column != some_value]

[Source](https://stackoverflow.com/a/18173074/7595633)

# Save Pandas DataFrame to a CSV

In [None]:
df.to_csv(file_name, sep='\t', encoding='utf-8', index=False)

[Source](https://stackoverflow.com/a/16923367/7595633)

# Is it count_values or values_count or what?

It's value_counts. If you asked this question, you might wanna try [Deepnote](https://deepnote.com) which has autocomplete and would tell you.

In [None]:
df.value_counts()

[Source](https://stackoverflow.com/a/18173074/7595633)