# <span style="color:maroon">**Manipulating DataFrames with Pandas**</span>

## <span style="color:blue">**Functions and Methods**</span>

#### DataFrame slicing
`df["column_label"]["row_label"]`  
`df.loc["row_label", "column_label"]`  
`df.iloc[row_index, column_index]`
  
`df["column_label"]` **--> pandas Series**  
`df[["column_label"]` **--> pandas DataFrame**  
`df[["column_label_1", "column_label_2", et...]]` **--> pandas DataFrame with only the column labels provided**  

`df["row_start":"row_end"]` **--> pandas DataFrame for rows between start and end inclusive**  
`df["row_end":"row_start":-1]` **--> pandas DataFrame for rows between start and end in reverse order inclusive**  
`df.loc[:, :"column_rightmost"]` **--> pandas DataFrame for all rows all columns from the left to rightmost inclusive**  
`df.loc[:, "column_start:"column_end"]` **--> pandas DataFrame for all rows all columns between start and end inclusive**  
`df.loc[row_list, column_list]` **--> pandas DataFrame for labels in row_list and column_list (this is a highly configured slice)**  

#### When a boolean series is used to slice a dataframe, it is called a filter
`boolean_series = df["column_label"] > value`  
`df_filtered = df[boolean_series]`  
`df["column_label_1"][boolean_series] = assign_new_value` **--> assign a new value to a column using a row-based boolean filter**  

#### Dropping data
`df.dropna(how="any")` **--> drops rows in df where any column values is NaN**  
`df.dropna(how="all")` **--> drops rows in df where all column values are NaN**  
`df.dropna(thres=threshold_value, axis="columns")` **--> drop columns with less than threshold_value non-missing values**  

#### Tranforming data
`df.apply(func)` **--> applies func to every element in df**  

##### <span style="color:red">**Example of using .map()**</span>
`red_vs_blue = {"Obama": "blue", "Romney": "red"}` **--> dictionary with keys corresponding to the categorical values that you want to map**  
`election["color"] = election["winner"].map(red_vs_blue)` **--> # use the dictionary to map the "winner" column to the new column "color"  

##### <span style="color:red">**Vectorizing over looping**</span>
`df.floordiv(12)` **--> this is a pandas method that utilizes vectorization**  

`numpy.floordiv(df, 12)` **--> this is a numpy ufunc (universal function) that also utilizes vectorization**  

>`def dozens(n):  
      return n // 12`
>
>`df.apply(dozens)`  

`df.apply(lambda n: n // 12)`

##### Vectorized methods work on pandas Series as well
`df.index = df.index.str.upper()`  

`df.index = df.index.map(str.upper)`  **--> for a DataFrame index, .map applies a function to the elements in an index**  

##### Example of using a function in a vectorized manner
>`from scipy.stats import zscore  
 turnout_zscore = zscore(election["turnout"])  
 election["turnout_zscore"] = turnout_zscore`  

## <span style="color:blue">**Advanced Indexing**</span>

#### Key building blocks of pandas data structures
1. indexes: sequences of labels, immutable (if you want to modify the index then you hneed to change the whole index)
2. Series: 1D array with an associated index
3. DataFrames: 2D array with Series as columns

##### <span style="color:red">**You should try to create data structures where indexes are unique (although this is not a requirement)**</span>

##### <span style="color:red">**Modifying an entire index**</span>
>`new_idx = [x.upper() for x in df.index]`  
`df.index = new_idx`  
`df.index.name = "index_name_label"`  
`df.columns.name = "columns_name_label"`  

##### <span style="color:red">**Creating a index from scratch**</span>
`index_list` **--> list that you want to use to generate an index**  
`df.index = index_list`  

##### <span style="color:red">**Hierarchical index (multi-index)**</span>
`df.loc[["outer_index_label", "inner_index_label"]]` **--> retrieves the relevant rows of df**  
`df["outer_index_row_label_start":"outer_index_row_label_end"]` **--> retrieves relevant rows between start and end inclusive**  
`df = df.set_index(["outer_index_label", "inner_index_label"])` **--> sets the multi-index for df**  
`df = df.sort_index()`

##### Accessing the outermost index works like single index slicing
##### Accessing inner indices requires --> slice <-- This is going to require some practice!!!
`df.loc[("outer_index_label", "inner_index_label")]` **--> observe that a tuple is being passed to the indexer**  
`df.loc[(["outer_index_labels"], "inner_index_label"), :]` **--> the tuple is defining the rows within the multi-index**  
`df.loc[(slice(None), 2), :]` **--> slice(None) is removing filtering on the outer index**  


## <span style="color:blue">**Rearranging and Reshaping Data**</span>

##### <span style="color:red">**Pivot**</span>

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3,4])

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("A pivoted dataframe: df --> pivoted_users \n")
pivoted_users = pd.pivot(data=users, index="weekday", columns="city", values="visitors")
print(pivoted_users, "\n")

print("A stratified pivoted dataframe: df --> stratified_users \n")
stratified_users = pd.pivot(data=users, index="weekday", columns="city")
print(stratified_users)

##### <span style="color:red">**Stack and Unstack**</span>

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3,4], index_col=["city", "weekday"]).sort_index()

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("users ""weekday"" column is now unstacked: df --> unstacked_users \n")
unstacked_users = users.unstack("weekday")
print(unstacked_users, "\n")

print("unstacked_users ""weekday"" column is now stacked again: df --> restacked_users \n")
restacked_users = unstacked_users.stack("weekday")
print(restacked_users, "\n")

print("Are users and restacked_users identical?  ", users.equals(restacked_users))

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3,4], index_col=["city", "weekday"]).sort_index()

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("users ""city"" column is now unstacked: df --> unstacked_users \n")
unstacked_users = users.unstack("city")
print(unstacked_users, "\n")

print("unstacked_users ""city"" column is now stacked again: df --> restacked_users \n")
restacked_users = unstacked_users.stack("city")
print(restacked_users, "\n")

print("Are users and restacked_users identical?  ", users.equals(restacked_users), "\n")

print("swap levels 0 and 1: df --> swapped_levels_users \n")
swapped_levels_users = restacked_users.swaplevel(0, 1)
print(swapped_levels_users, "\n")

print("Are users and swapped_levels_users identical?  ", users.equals(swapped_levels_users), "\n")

print("swapped_levels_users index sorted: df --> swapped_levels_users \n")
swapped_levels_users = swapped_levels_users.sort_index()
print(swapped_levels_users, "\n")

print("Are users and swapped_levels_users identical?  ", users.equals(swapped_levels_users))

##### <span style="color:red">**Melt**</span>

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3]).sort_values(["city", "weekday"]).reset_index(drop=True)

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("A pivoted dataframe: df --> pivoted_users \n")
pivoted_users = users.pivot(index="weekday", columns="city", values="visitors")
print(pivoted_users, "\n")

print("A reset index dataframe: df --> reset_index_users \n")
reset_index_users = pivoted_users.reset_index()
print(reset_index_users, "\n")

print("A melted dataframe: df --> melted_users \n")
melted_users = pd.melt(reset_index_users, id_vars=["weekday"], value_name="visitors")
print(melted_users, "\n")

print("Are users and melted_users identical?  ", users.equals(melted_users))

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3,4]).sort_values(by=["weekday", "city"], ascending=[False, True]).reset_index(drop=True)

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("A melted dataframe: df --> melted_users \n")
melted_users = pd.melt(users, id_vars=["weekday", "city"])
print(melted_users, "\n")

print("A melted dataframe with nicer column names: df --> melted_users_2 \n")
melted_users_2 = pd.melt(users, id_vars=["weekday", "city"], var_name="metric", value_name="count")
print(melted_users_2, "\n")

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3,4]).sort_values(by=["weekday", "city"], ascending=[False, True]).reset_index(drop=True)

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("An indexed dataframe: df --> indexed_users \n")
indexed_users = users.set_index(["city", "weekday"])
print(indexed_users, "\n")

#print("A dataframe with data values shown in key-value pairs: df --> kv_users \n")
kv_users = pd.melt(indexed_users, col_level=0, var_name="metric", value_name="count")
print(kv_users, "\n")

##### <span style="color:red">**Pivot Table**</span>

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3,4]).sort_values(by=["weekday", "city"], ascending=[False, True]).reset_index(drop=True)

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("A stratified pivoted dataframe: df --> stratified_users \n")
stratified_users = users.pivot_table(index="weekday", columns="city")
print(stratified_users, "\n")

print("A use-case for aggfunc: df --> users_counted_1 \n")
users_counted_1 = users.pivot_table(index="weekday", aggfunc="count")
print(users_counted_1, "\n")

print("An identical implementation as above with use-case for aggfunc: df --> users_counted_2 \n")
users_counted_2 = users.pivot_table(index="weekday", aggfunc=len)
print(users_counted_2, "\n")

In [None]:
import pandas as pd
users = pd.read_csv("./data/users.csv", usecols=[1,2,3,4]).sort_values(by=["weekday", "city"], ascending=[False, True]).reset_index(drop=True)

print("The starting dataframe: df --> users \n")
print(users, "\n")

print("Generating sums by using aggrunc: df --> sum_by_weekday_users \n")
sum_by_weekday_users = users.pivot_table(index="weekday", aggfunc=sum)
print(sum_by_weekday_users, "\n")

print("Generating sums with grand-totals using aggfunc: df --> sum_with_grand_totals_users \n")
sum_with_grand_totals_users = users.pivot_table(index="weekday", aggfunc=sum, margins=True)
print(sum_with_grand_totals_users)

## <span style="color:blue">**Grouping Data**</span>

##### <span style="color:red">**Group By**</span>

In [17]:
import pandas as pd
titanic = pd.read_csv("./data/titanic.csv")

print("The starting dataframe: df --> titanic \n")
print(titanic, "\n")

print("Group by pclass (note that this is a DataFrameGroupBy object): ob --> by_class \n")
by_class = titanic.groupby("pclass")
print(by_class, "\n")

print("Calculate count by group pclass: sr --> count_by_pclass \n")
count_by_pclass = by_class["survived"].count()
print(count_by_pclass, "\n")

print("Group by embarked and pclass (note that this is also a DataFrameGroupBy object): ob --> by_embarked_pclass \n")
by_embarked_pclass = titanic.groupby(["embarked", "pclass"])
print(by_embarked_pclass, "\n")

print("Calculate count by multi-group embarked and pclass: sr --> count_by_embarked_pclass \n")
count_by_embarked_pclass = by_embarked_pclass["survived"].count()
print(count_by_embarked_pclass, "\n")

The starting dataframe: df --> titanic 

      pclass  survived                                             name  \
0          1         1                    Allen, Miss. Elisabeth Walton   
1          1         1                   Allison, Master. Hudson Trevor   
2          1         0                     Allison, Miss. Helen Loraine   
3          1         0             Allison, Mr. Hudson Joshua Creighton   
4          1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
...      ...       ...                                              ...   
1304       3         0                             Zabour, Miss. Hileni   
1305       3         0                            Zabour, Miss. Thamine   
1306       3         0                        Zakarian, Mr. Mapriededer   
1307       3         0                              Zakarian, Mr. Ortin   
1308       3         0                               Zimmerman, Mr. Leo   

         sex    age  sibsp  parch  ticket      fare    cab

In [55]:
import pandas as pd
life = pd.read_csv("./data/life_expectancy_at_birth.csv", usecols=(list(range(0, 10))), index_col=0).reindex() #usecols=list(range(10)),

print("The starting dataframe: df --> life \n")
print(life, "\n")

#print("Group by pclass (note that this is a DataFrameGroupBy object): ob --> by_class \n")
#by_class = titanic.groupby("pclass")
#print(by_class, "\n")

#print("Calculate count by group pclass: sr --> count_by_pclass \n")
#count_by_pclass = by_class["survived"].count()
#print(count_by_pclass, "\n")

#print("Group by embarked and pclass (note that this is also a DataFrameGroupBy object): ob --> by_embarked_pclass \n")
#by_embarked_pclass = titanic.groupby(["embarked", "pclass"])
#print(by_embarked_pclass, "\n")

#print("Calculate count by multi-group embarked and pclass: sr --> count_by_embarked_pclass \n")
#count_by_embarked_pclass = by_embarked_pclass["survived"].count()
#print(count_by_embarked_pclass, "\n")

The starting dataframe: df --> life 

           Life expectancy   1800   1801   1802   1803   1804   1805   1806  \
0                 Abkhazia    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
1              Afghanistan  28.21  28.20  28.19  28.18  28.17  28.16  28.15   
2    Akrotiri and Dhekelia    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3                  Albania  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
4                  Algeria  28.82  28.82  28.82  28.82  28.82  28.82  28.82   
..                     ...    ...    ...    ...    ...    ...    ...    ...   
255             Yugoslavia    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
256                 Zambia  32.60  32.60  32.60  32.60  32.60  32.60  32.60   
257               Zimbabwe  33.70  33.70  33.70  33.70  33.70  33.70  33.70   
258                  Åland    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
259            South Sudan  26.67  26.67  26.67  26.67  26.67  26.67  26.67   

      1807  


## <span style="color:blue">**Bringing It All Together**</span>