# Data Manipulation with Pandas
- **Methods**
    - df.head()
    - df.tail()
    - df.info()
    - df.describe()
    - `.size()` function count the number of elements along a given axis.
- **attributes**
    - df.values
    - df.columns
    - df.index
    
#### Sorting
- df.sort_values("column_name", ascending = $\color{red}{\text{False}}$)
- df.sort_values(["column_1","columns_2"])
- df.sort_values(["column_1","columns_2"], ascending = [$\color{red}{\text{True}}$, $\color{red}{\text{False}}$])

#### Subsetting columns
- df["column_name"]
- df[["column_one", "column_two"]]
    - outter bracket: subsetting the dataframe
    - inner bracket: list of column names to subset
    - eg: col_to_subset = ["column_one", "column_two"] ----> df[col_to_subset]
    
#### Subsetting rows
- df["column"] > 50      ----> returns boolean
- df[df["column"] > 50]  ----> returns dataframe 
- df[df["column"] == "text_value"]
- df[df["date"] > "date_value"]

#### Subsetting based on multiple conditions
- is_lab = dogs["breed"] == "Labrador"
- is_brown = dogs["Brown"] == "Brown"
- dogs = [is_lab & is_brown]
- dogs[(dog["breed"] == "Labrador") & (dogs["color"] == "Brown")]

#### Subsetting using `.isin()`
- is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
- dogs[is_black_or_brown]

#### New Column
- Adding a new column
    - dogs["height_m"] = dogs["height_cm"] / 100
        - BMI = weight_in_kg / (height_in_cm$)^{2}$

#### Summary Statistics
- df["column"].mean()
    - `.median()`,`.mode()`
    - `.min()`,`.max()`
    - `.var()`,`.std()`
    - `.sum()`
    - `.quantile()`
    
#### The `.agg()` method
$\color{blue}{\text{def}}$ $\color{red}{\text{pct30}}$ (column):
<br>&emsp;&emsp; $\color{blue}{\text{return}}$ column.quantile($\color{red}{\text{0.3}}$) 
<br>
<br>dogs["weight_kg"].agg(pct30)
<br>dogs[["weight_kg", "height_cm"]].agg(pct30)

#### Multiple summaries
$\color{blue}{\text{def}}$ $\color{red}{\text{pct40}}$ (column):
<br>&emsp;&emsp; $\color{blue}{\text{return}}$ column.quantile($\color{red}{\text{0.4}}$)
<br>
<br>dogs["weight_kg"].agg([pct30, pct40])

#### Cumulative sum
dogs["weight_kg"].cumsum()
<br>
- `.cummax()`
- `.cummin()`
- `.cumprod()`

#### Counting
- Dropping duplicate name
    - eg: df.drop_duplicates(subset="column_name_to_drop_duplicates")
    - eg: df.drop_duplicates(subset=["column_one", "column_two"])
- df["column"].value_counts()
- df["column"].value_counts(sort=$\color{red}{\text{True}}$)
- df["column"].value_counts(normalize=$\color{red}{\text{True}}$)

#### Grouped summary statistics
- Grouped summaries
- eg: dogs.groupby("color")["weight_kg"].mean()
- eg: dogs.groupby("color")["weight_kg"].agg([min,max,sum)]
- eg: dogs.groupby(["color","breed"])["weight_kg"].mean()
- eg: dogs.groupby(["color","breed"])[["weight_kg", "hieght_cm"]].mean()

#### Pivot tables
- Pivot tables are another way of calculating grouped summary statistics. 
- We grouped the dogs by color and calculated their mean weights. We can do the same thing using the pivot_table method. The "values" argument is the column that you want to summarize, and the index column is the column that you want to group by. By default, pivot_table takes the mean value for each group.
    - dogs.groupby($\color{red}{\text{"color"}}$)[$\color{red}{\text{"weight_kg"}}$].mean()
    - dogs.pivot_table(values=$\color{red}{\text{"weight_kg"}}$, index=$\color{red}{\text{"color"}}$)
    
#### Different statistics
$\color{blue}{\text{import}}$ numpy $\color{blue}{\text{as}}$ np
<br>dogs.pivot_table(values=$\color{red}{\text{"weight_kg"}}$, index=$\color{red}{\text{"color"}}$, aggfunc=np.median)

#### Multiple statistics
dogs.pivot_table(values=$\color{red}{\text{"weight_kg"}}$, index=$\color{red}{\text{"color"}}$, aggfunc=[np.mean, np.median])

#### Pivot on two variables
- dogs.groupby([$\color{red}{\text{"color"}}$, $\color{red}{\text{"breed"}}$])[$\color{red}{\text{"weight_kg"}}$].mean()
- dogs.pivot_table(values=$\color{red}{\text{"weight_kg"}}$, index=$\color{red}{\text{"color"}}$, columns=$\color{red}{\text{"breed"}}$)

#### Filling missing values in pivot tables
- dogs.pivot_table(values=$\color{red}{\text{"weight_kg"}}$, index=$\color{red}{\text{"color"}}$, columns=$\color{red}{\text{"breed"}}$, fill_value=$\color{red}{\text{0}}$)

#### Summing with pivot tables
- dogs.pivot_table(values=$\color{red}{\text{"weight_kg"}}$, index=$\color{red}{\text{"color"}}$, columns=$\color{red}{\text{"breed"}}$, fill_value=$\color{red}{\text{0}}$, margins=$\color{red}{\text{True}}$)

#### Explicit indexes
- **Setting a column as the index**
    - You can move a column from the body of the DataFrame to the index. This is called "setting an index," and it uses the set_index method. *(dogs_ind = dogs.set_index("name"))*
    - To undo what you just did, you can reset the index - that is, you remove it.  *(dogs_ind.reset_index(drop=$\color{red}{\text{True}}$))*
    
- The values in the index don't need to be unique. Here, there are two Labradors in the index.
- DataFrames have a subsetting method called "loc," which filters on index values. 
- You can include multiple columns in the index by passing a list of column names to set_index.

#### Multi-level indexes a.k.a. hierarchical indexes
- dogs_ind3 = dogs.set_index([$\color{red}{\text{"breed"}}$,$\color{red}{\text{"color"}}$])
- To take a subset of rows at the outer level index, you pass a list of index values to loc.
    - dogs_ind3.loc[[$\color{red}{\text{"Labrador"}}$, $\color{red}{\text{"Chihuahua"}}$]]
- To subset on inner levels, you need to pass a list of tuples. The resulting rows have to match all conditions from a tuple.
    - dogs_ind3.loc[[($\color{red}{\text{"Labrador"}}$, $\color{red}{\text{"Brown"}}$), ($\color{red}{\text{"Chihuahua"}}$, $\color{red}{\text{"Tan"}}$)]]
    
#### Sorting by index values
- By default, it sorts all index levels from outer to inner, in ascending order.
    - dogs_ind3.sort_index()

#### Controlling sort_index
- You can control the sorting by passing lists to the level and ascending arguments.
    - dogs_ind3.sort_index(level=[$\color{red}{\text{"color"}}$, $\color{red}{\text{"breed"}}$], ascending=[$\color{red}{\text{True}}$, $\color{red}{\text{False}}$])

#### Now you have two problems
- Index values are just data
- Indexes violate "tidy data" principles
    - Indexes are controversial. Although they simplify subsetting code, there are some downsides. Index values are just data. Storing data in multiple forms makes it harder to think about. There is a concept called "tidy data," where data is stored in tabular form - like a DataFrame. Each row contains a single observation, and each variable is stored in its own column. Indexes violate the last rule since index values don't get their own column. In pandas, the syntax for working with indexes is different from the syntax for working with columns. By using two syntaxes, your code is more complicated, which can result in more bugs.
    
### Slicing lists
- a technique for selecting consecutive elements from objects.
    - To slice the list, you pass first and last positions separated by a colon into square brackets. Remember that Python positions start from zero, so 2 refers to the third element, Chow Chow. Also remember that the last position, 5, is not included in the slice.
    - You can also slice DataFrames, but first, you need to sort the index.
        - To slice rows at the outer level of an index, you call loc, passing the first and last values separated by a colon. 
    - The correct approach to slicing at inner index levels is to pass the first and last positions as tuples. 
    - Since DataFrames are two-dimensional objects, you can also slice columns. You do this by passing two arguments to loc. The simplest case involves subsetting columns but keeping all rows. To do this, pass a colon as the first argument to loc. As with slicing lists, a colon by itself means "keep everything." The second argument takes column names as the first and last positions to slice on.
    - You can slice on rows and columns at the same time: simply pass the appropriate slice to each argument. 
    - An important use case of slicing is to subset DataFrames by a range of dates. 
        - You slice dates with the same syntax as other types. The first and last dates are passed as strings.
        - One helpful feature is that you can slice by partial dates. Here, the first and last positions are only specified as 2014 and 2016, with no month or day parts. pandas interprets this as slicing from the start of 2014 to the end of 2016; that is, all dates in 2014, 2015, and 2016.
        - You can also slice DataFrames by row or column number using the iloc method. This uses a similar syntax to slicing lists, except that there are two arguments: one for rows and one for columns. Notice that, like list slicing but unlike loc, the final values aren't included in the slice. In this case, the fifth row and fourth column aren't included.


In [1]:
breeds = ["Labrador", "Poodle", "Chow Chow", "Schnauzer", "Labrador","Chihuahua","St. Bernard"]
print(breeds)

['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']


In [2]:
# Slicing lists
print(breeds[2:5])
print(breeds[:3])
print(breeds[:])

['Chow Chow', 'Schnauzer', 'Labrador']
['Labrador', 'Poodle', 'Chow Chow']
['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']


In [3]:
import pandas as pd
# 
data = [['Chihuahua', 'Tan', 'Stella',18, 2, '2015-04-20'],['Chow Chow', 'Brown', 'Lucy',46, 22, '2014-08-25'], 
        ['Labrador', 'Black', 'Max',59, 29, '2017-01-20'], ['Labrador', 'Brown', 'Bella',56, 25, '2013-07-01'],
        ['Poodle', 'Black', 'Charlie',43, 23, '2016-09-16'],['Schnauzer', 'Grey', 'Cooper',49, 17, '2011-12-11'],
        ['St. Bernard', 'White', 'Bernie',77, 74, '2018-02-27']]
# 
df = pd.DataFrame(data, columns = ['breed', 'color', 'name', 'height_cm', 'weight_kg', 'date_of_birth'])

In [4]:
# Sort index before you slice
df_sort = df.set_index(['breed', 'color']).sort_index()
df_sort

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chihuahua,Tan,Stella,18,2,2015-04-20
Chow Chow,Brown,Lucy,46,22,2014-08-25
Labrador,Black,Max,59,29,2017-01-20
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Schnauzer,Grey,Cooper,49,17,2011-12-11
St. Bernard,White,Bernie,77,74,2018-02-27


In [5]:
# Slicing the outer index level
df_sort.loc["Chow Chow":"Poodle"]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chow Chow,Brown,Lucy,46,22,2014-08-25
Labrador,Black,Max,59,29,2017-01-20
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16


In [6]:
# Slicing the inner index levels correctly
df_sort.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg,date_of_birth
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Schnauzer,Grey,Cooper,49,17,2011-12-11


In [7]:
# Slicing columns
df_sort.loc[:, "name":"height_cm"]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1
Chihuahua,Tan,Stella,18
Chow Chow,Brown,Lucy,46
Labrador,Black,Max,59
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Schnauzer,Grey,Cooper,49
St. Bernard,White,Bernie,77


In [8]:
# Slice twice
df_sort.loc[("Labrador", "Brown"):("Schnauzer", "Grey"), "name":"height_cm"]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Schnauzer,Grey,Cooper,49


In [9]:
# Dogs days
df = df.set_index("date_of_birth").sort_index()
df

Unnamed: 0_level_0,breed,color,name,height_cm,weight_kg
date_of_birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011-12-11,Schnauzer,Grey,Cooper,49,17
2013-07-01,Labrador,Brown,Bella,56,25
2014-08-25,Chow Chow,Brown,Lucy,46,22
2015-04-20,Chihuahua,Tan,Stella,18,2
2016-09-16,Poodle,Black,Charlie,43,23
2017-01-20,Labrador,Black,Max,59,29
2018-02-27,St. Bernard,White,Bernie,77,74


In [10]:
# Slicing by dates
# Get dogs with date_of_birth between 2014-08-25 and 2016-09-16
df.loc["2014-08-25":"2016-09-16"]

Unnamed: 0_level_0,breed,color,name,height_cm,weight_kg
date_of_birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Chow Chow,Brown,Lucy,46,22
2015-04-20,Chihuahua,Tan,Stella,18,2
2016-09-16,Poodle,Black,Charlie,43,23


In [11]:
# Slicing by partial dates
# Get dogs with date_of_birth between 2014-01-01 and 2016-12-31
df.loc["2014":"2017"]

Unnamed: 0_level_0,breed,color,name,height_cm,weight_kg
date_of_birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Chow Chow,Brown,Lucy,46,22
2015-04-20,Chihuahua,Tan,Stella,18,2
2016-09-16,Poodle,Black,Charlie,43,23


In [12]:
# Subsetting by row/column number
print(df, '\n', '\n', df.iloc[2:5, 1:4])
# print('\n')
# print(df.iloc[2:5, 1:4])

                     breed  color     name  height_cm  weight_kg
date_of_birth                                                   
2011-12-11       Schnauzer   Grey   Cooper         49         17
2013-07-01        Labrador  Brown    Bella         56         25
2014-08-25       Chow Chow  Brown     Lucy         46         22
2015-04-20       Chihuahua    Tan   Stella         18          2
2016-09-16          Poodle  Black  Charlie         43         23
2017-01-20        Labrador  Black      Max         59         29
2018-02-27     St. Bernard  White   Bernie         77         74 
 
                color     name  height_cm
date_of_birth                           
2014-08-25     Brown     Lucy         46
2015-04-20       Tan   Stella         18
2016-09-16     Black  Charlie         43


In [13]:
df.groupby("breed")["color"].size()

breed
Chihuahua      1
Chow Chow      1
Labrador       2
Poodle         1
Schnauzer      1
St. Bernard    1
Name: color, dtype: int64

### Working with pivot tables
- Create a pivot table by calling (`.pivot_table`) dot-pivot_table. The first argument is the column name containing values to aggregate. The index argument lists the columns to group by and display in rows, and the columns argument lists the columns to group by and display in columns.

#### .loc[] + slicing is a power combo
- Pivot tables are just DataFrames with sorted indexes. In particular, the loc and slicing combination is ideal for subsetting pivot tables, like so.
    - df.loc["Chow Chow":"Poodle"]

#### The axis argument
- The methods for calculating summary statistics on a DataFrame, such as mean, have an axis argument. The default value is "index," which means "calculate the statistic across rows."
    - df.mean(axis="index")

#### Calculating summary stats across columns
- To calculate a summary statistic for each row, that is, "across the columns," you set axis to "columns." For most DataFrames, setting the axis argument doesn't make any sense, since you'll have different data types in each column. Pivot tables are a special case since every column contains the same data type.
    - df.mean(axis="columns")

#### Examples:
###### # Add a year column to temperatures
temperatures['year'] = temperatures.date.dt.year

###### # Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index=["country", "city"], columns="year")

###### # See the result
print(temp_by_country_city_vs_year)

###### # Subset for Egypt to India
temp_by_country_city_vs_year.loc["Egypt":"India"]

###### # Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi")]

###### # Subset in both directions at once
temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi"),"2005":"2010"]

###### # Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

###### # Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

###### # Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis='columns')

###### # Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])

#### Visualizing your data
$\color{blue}{\text{import}}$ matplotlib.pyplot $\color{blue}{\text{as}}$ plt
<br>
<br> df[$\color{red}{\text{"column_name"}}$].hist(bins=$\color{red}{\text{5}}$)
<br> plt.show()

##### Bar plots
df_groupby = df.groupby($\color{red}{\text{"breed"}}$)[$\color{red}{\text{"weight_kg"}}$].mean()
<br> df_groupby.plot(kind=$\color{red}{\text{"bar"}}$, title=$\color{red}{\text{"Mean Weight by Dog Breed"}}$)
<br> plt.show()

##### Line plots
df.plot(x=$\color{red}{\text{"date"}}$, y=$\color{red}{\text{"weight_kg"}}$, kind=$\color{red}{\text{"line"}}$)
<br> plt.show()

##### Rotating axis labels
df.plot(x=$\color{red}{\text{"date"}}$, y=$\color{red}{\text{"weight_kg"}}$, kind=$\color{red}{\text{"line"}}$, rot=$\color{red}{\text{45}}$)
<br> plt.show()

##### Scatter plots
df.plot(x=$\color{red}{\text{"date"}}$, y=$\color{red}{\text{"weight_kg"}}$, kind=$\color{red}{\text{"scatter"}}$)
<br> plt.show()

##### Layering plots
- alpha argument: 0 means completely transparent that is, invisible, and 1 means completely opaque.
<br> df[df[$\color{red}{\text{"sex"}}$] === $\color{red}{\text{"F"}}$][$\color{red}{\text{"height_cm"}}$].hist(alpha=$\color{red}{\text{0.7}}$)
<br> df[df[$\color{red}{\text{"sex"}}$] === $\color{red}{\text{"M"}}$][$\color{red}{\text{"height_cm"}}$].hist(alpha=$\color{red}{\text{0.7}}$)
<br> plt.legend([$\color{red}{\text{"F"}}$, $\color{red}{\text{"M"}}$])
<br> plt.show()

#### Missing values
-  Detecting missing values
    - When you first get a DataFrame, it's a good idea to get a sense of whether it contains any missing values, and if so, how many. That's where the isna method comes in. When we call isna on a DataFrame, we get a Boolean for every single value indicating whether the value is missing or not, but this isn't very helpful when you're working with a lot of data. `.isna()`
- Detecting any missing values
    - If we chain dot-isna with dot-any, we get one value for each variable that tells us if there are any missing values in that column. `.isna().any()`
- Counting missing values
    - Since taking the sum of Booleans is the same thing as counting the number of Trues, we can combine sum with isna to count the number of NaNs in each column. `.isna().sum()`
    
##### Plotting missing values
$\color{blue}{\text{import}}$ matplotlib.pyplot $\color{blue}{\text{as}}$ plt
<br>
<br> df.isna().$\color{green}{\text{sum}}$().plot(kind=$\color{red}{\text{"bar"}}$)
<br> plt.show()

##### Removing missing values
- One option is to remove the rows in the DataFrame that contain missing values. This can be done using the dropna method. However, this may not be ideal if you have a lot of missing data, since that means losing a lot of observations. `.dropna()`

##### Replacing missing values
- Another option is to replace missing values with another value. The fillna method takes in a value, and all NaNs will be replaced with this value. There are also many sophisticated techniques for replacing missing values, which you can learn more about in our course about missing data. `.fillna(`$\color{red}{\text{0}}$`)`

##### Creating DataFrame
- Dictionaries
    - my_dict = {$\color{red}{\text{"key1"}}$: value1, 
    <br> &emsp;&emsp; $\color{red}{\text{"key2"}}$: value2, 
    <br> &emsp;&emsp; $\color{red}{\text{"key3"}}$: value3,
    <br> &emsp;&emsp; } 
    - A dictionary is a way of storing data in Python. It holds a set of key-value pairs.
        - There are many ways to create DataFrames from scratch, but we'll discuss two ways: from a list of dictionaries and from a dictionary of lists. In the first method, the DataFrame is built up row by row, while in the second method, the DataFrame is built up column by column.

#### Example: List of dictionaries - by row
list_of_dicts = [{"name": "Ginger", "breed": "Dachshund", "height_cm": 22,
                <br> &emsp;&emsp; "weight_kg": 10, "date_of_birth": "2019-03-14"},
                <br> &emsp;&emsp; {"name": "Scott", "breed": "Dalmatian", "height_cm": 59,
                <br> &emsp;&emsp; "weight_kg": 25, "date_of_birth": "2009-05-09"}
                <br> &emsp;&emsp; ]
<br> new_dogs = pd.DataFrame(list_of_dicts)

#### Example: List of dictionaries - by column
- **Key** = columns name
- **Value** = list of column values
<br>
<br> dict_of_lists = {
        <br>&emsp;&emsp;  "name": ["Ginger", "Scott"],
        <br>&emsp;&emsp;  "breed": ["Dachshund", "Dalmatian"],
        <br>&emsp;&emsp;  "height_cm": [22,59],
        <br>&emsp;&emsp;  "weight_kg": [10,25],
        <br>&emsp;&emsp;  "date_of_birth": ["2019-03-14", "2009-05-09"],
        <br>&emsp;&emsp;  }
<br> new_dogs = pd.DataFrame(list_of_dicts)

#### Reading and writing CSVs
- CSV = comma-separated values
- Designed for DataFrame-like data
- Most database and spreadsheet programs can use them or create them
- df = pd.read_csv($\color{red}{\text{"file_name.csv"}}$)
- df.to_csv($\color{red}{\text{"file_name.csv"}}$)