# Transforming DataFrames

### I. Sorting

`df.sort_values("column_name")`: sorts a df according to ascending row
`df.sort_values("column_name", ascending = False)`: largest first

#### a. Sorting by multiple columns:

`df.sort_values(["column_name1", column_name2"], ascending = [True, False])`

### II. Subsetting rows

`df["column_name"]`: returns the specified column 

> <span style = "color:royalblue"> to subset multiple columns pass in a list of column names.  </span>

#### a. Using Comparison Operators

`df["column_name"] > 50`: returns boolean values relative to comparison operator **for the specified column**

> <span style = "color:royalblue"> 
the comparison conditions can be used within square brackets to subset the rows we are interested in.</span>

> `df[df["column_name"] > 50]`: returns all rows relative to > condition. Other comparison operators can be used as well.

#### b. Subsetting based on multiple conditions

Use the following three steps together:
1. `variable1 = df["column1"] == "some_filter_value"`
2. `variable2 = df["column2"] == "some_filter_value"`
3. `df[variable1 & variable2]`

> this will subset based upon the conditions specified in variable 1 & variable 2. 

#### c. Subsetting using .isin()

`some_filter_var = df["column"].isin(["filter_value1", "filter_value2"])` <br>
`df[some_filter_var]`: returns the rows subset by filter_value1 and 2 from the column

### IV. Creating a new Column

`df["new_column_name"] = df["old_column_name"] / 100`
> <span style = "color:royalblue"> here `/ 100` is an example of manipulating data to get different values to be stored in the new column </span>

### Explicit indexes

**`df.columns`** contains an index object of column names

**`df.index`** contains an index object of row numbers

### Setting a column as the index

**`df_ind = df.set_index("name")`** moves a column from the body of the df to the index. 
<br>
<span style = "color:indianred"> Note that **df_ind** (here and below) is a user defined variable to represent the df with a named index</span>

#### pre-index assignment:

![pre-index](pre-index.png)

#### post-index assignment:

![post-index](post-index.png)

### Removing an index

**`df_ind.reset_index()`**: resets to original df <br>
**`df_ind.reset_index(drop = True)`**: removes the named index altogether

### Subsetting using an index

**`df_ind.loc[["name", "name"]]`** filters on **index values**. <br>
> <span style = "color:royalblue"> index values **do not** need to be unique. </span>

### Multi-level indexes

**`df_ind = df.set_index(["name", "name"])`** creates a heirarchical index

> <span style = "color:royalblue"> when using **.loc** on a multi-index, all inner indexes that match the called outer index will be returned. Passing a list of named indexes will return a list of information matching the **outer indexes**</span>

![multi-index](multi-index.png)

### Subsetting on inner levels with a list of tuples

`df_ind = df.set_index([("outer name1", "inner name1"), ("outer name2", "inner name2)])`

### Sorting by index values

`df_ind.sort_index()`: sorts all indexes from outer to inner in ascending order

#### to control the sorting: 

`df_ind.sort_index(level= ["index name1", "index name2"], ascending= [True, False])`

### Slicing by index values (sort first)

#### By outer only:
**`df.loc["outer name1": "outer name2"]`**

#### By inner also:
**`df.loc[("outer name1", "inner name1") : ("outer name2", "inner name2)]`**

> <span style = "color:indianred"> Note that pd will not throw an error if you try to slice only by inner indexes</span>

### Slicing by column 

#### Subsetting columns, while keeping all rows:
**`df.loc[:, "column1":"column2"]`**

#### Slicing on columns and rows:
**`df.loc[("outer name1", "inner name1") : ("outer name2", "inner name2), "column1":"column2"]`**


<br>
<br>
<br>
<br>
<br> 

# Missing Values

`df.isna()`: returns a boolean value if the value is missing or not
<br>

`df.isna().any()`: returns a boolean value at the column level for any missing values in a particular column
<br>

`df.isna().sum()`: counts the number of missing values in a given column

> <span style = "color:royalblue"> you can plot the sum of missing values for a clear review of missing content  </span>
<br>

`df.dropna()`: removes the rows with the missing data from the dataframe
<br>

`df.fillna(0)`: fills the missing values with 0, so as to not lose the other information.

# Creating DataFrames


In [2]:
my_dict = {
    "key1": 1,
    "key2": 2,
    "key3": 3
}

In [4]:
my_dict["key1"] # acces value via keys

1

#### A. From a list of dictionaries

- constructed row by row

#### B. From a dictionary of lists

- constructed column by column

In [5]:
list_of_dicts = [
    {"key1": 1,"key2": 2,"key3": 3},
    {"key1": 4, "key2": 5, "key3": 6}
]

In [7]:
import pandas as pd
pd.DataFrame(list_of_dicts)

Unnamed: 0,key1,key2,key3
0,1,2,3
1,4,5,6


#### Dict of lists

- **key** = column name
- **value** = list of column values

In [8]:
dict_of_lists = {
    "key1": [1,2],
    "key2": [3,4],
    "key3": [5,6]
}

In [9]:
pd.DataFrame(dict_of_lists)

Unnamed: 0,key1,key2,key3
0,1,3,5
1,2,4,6


# Data Merging Basics

#### Inner Join

`"some_variable" = df1.merge(df2, on= "column")`" will merge df1 with df2 on the specified column present in both dfs 

#### Suffixes

`"some_variable" = df1.merge(df2, on= "column", suffixes=("_df1_short", "_df2_short"))`: adds a suffix to columns so that we know which df the column came from

### Kinds of Relationships between tables

#### One-to-one

> every row in the left table is related to **one and only one** row in the right table

#### One-to-many

> every row in the left table is related to **one or more** rows in the right table. As a result, values from the left table will be repeated to match those in the right table.

### Merging multiple DataFrames

#### Single merge with multiple columns

`df1.merge(df2, on = ['col1', 'col2'])`

#### Multiple DataFrames

`df1.merge(df2, on = ['col1', 'col2']) \
.merge(df3, on='col3', suffixes=('_col1&2_short1', '_col3_short'))`
- a \ was used after cols1&2 to make a new line, which is read as one line

#### Even more tables

`df1.merge(df2, on="col")\
   .merge(df3, on="col")\
   .merge(df4, on="col")`

### Left Join

- Returns all rows of data from the left table, and only those from the right table where key columns match.

![left-join](left-join.png)