# Pandas DataFrame UltraQuick Tutorial

Machine Learning prework from Google ([link](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=mlcc-prework&hl=en)).


[**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), are the central data structure in the pandas API. A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

-   A DataFrame stores data in cells.
-   A DataFrame has named columns (usually) and numbered rows.


In [1]:
import numpy as np
import pandas as pd

## Creating a DataFrame

The following code cell creates a simple DataFrame containing 10 cells organized as follows:

-   5 rows
-   2 columns, one named `temperature` and the other named `activity`

The following code cell instantiates a `pd.DataFrame` class to generate a DataFrame. The class takes two arguments:

-   The first argument provides the data to populate the 10 cells. The code cell calls `np.array` to generate the 5x2 NumPy array.
-   The second argument identifies the names of the two columns.


In [2]:
# Create and populate a 5x2 NumPy array:
my_data = np.array([
    [0, 3],
    [10, 7],
    [20, 9],
    [30, 14],
    [40, 15],
])

# Create a Python list that holds the names of the two columns:
my_column_names = ['Temperature', 'Activity']

# Create a DataFrame:
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

print(my_dataframe)

   Temperature  Activity
0            0         3
1           10         7
2           20         9
3           30        14
4           40        15


## Adding a new column to a DataFrame

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name. For example, the following code creates a third column named `adjusted` in `my_dataframe`:


In [6]:
my_dataframe["Adjusted"] = my_dataframe["Activity"] + 2
print(my_dataframe)


   Temperature  Activity  Adjusted
0            0         3         5
1           10         7         9
2           20         9        11
3           30        14        16
4           40        15        17


## Specifying a subset of a DataFrame

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame. 

In [12]:
print('Rows #0, #1, and #2:')
print(my_dataframe.head(3), '\n')

print('Row #2:')
print(my_dataframe.iloc[[2]], '\n')

print('Rows #1, #2, and #3:')
print(my_dataframe[1:4], '\n')

print("Column 'Temperature'")
print(my_dataframe['Temperature'])

Rows #0, #1, and #2:
   Temperature  Activity  Adjusted
0            0         3         5
1           10         7         9
2           20         9        11 

Row #2:
   Temperature  Activity  Adjusted
2           20         9        11 

Rows #1, #2, and #3:
   Temperature  Activity  Adjusted
1           10         7         9
2           20         9        11
3           30        14        16 

Column 'Temperature'
0     0
1    10
2    20
3    30
4    40
Name: Temperature, dtype: int32


## Task 1: Create a DataFrame

Do the following:

  1. Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named `Eleanor`,  `Chidi`, `Tahani`, and `Jason`.  Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.

  2. Output the following:

     * the entire DataFrame
     * the value in the cell of row #1 of the `Eleanor` column

  3. Create a fifth column named `Janet`, which is populated with the row-by-row sums of `Tahani` and `Jason`.

To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial. 


In [31]:
# base = np.random.randint(low=0, high= 101, size=[3,4])
dataset = np.random.randint(0, 101, [3,4])
print(dataset)

column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']

df = pd.DataFrame(data=dataset, columns=column_names)
print(df)

df['Janet'] = df['Tahani'] + df['Jason']
print(df)


[[75 58 95 19]
 [88 40 59 16]
 [48  5  0 51]]
   Eleanor  Chidi  Tahani  Jason
0       75     58      95     19
1       88     40      59     16
2       48      5       0     51
   Eleanor  Chidi  Tahani  Jason  Janet
0       75     58      95     19    114
1       88     40      59     16     75
2       48      5       0     51     51


## Copying a DataFrame (optional)

Pandas provides two different ways to duplicate a DataFrame:

* **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other. 
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other. 

The difference is subtle, but important.

In [40]:
# Create a reference by assigning dataframe to a new variable:
reference_to_df = df
print('Original dataframe:\n', df)
print('Starting value of df: %d' % df['Jason'][1])
print('Starting value of reference_to_df: %d\n' % reference_to_df['Jason'][1])

# Make some modifications:
df.at[1, 'Jason'] = df['Jason'][1] + 5
print('Updated value of df: %d' % df['Jason'][1])
print('Updated value of reference_to_df: %d\n' % reference_to_df['Jason'][1])


# Create a true copy of dataframe:
copy_of_df = df.copy()
print('Original dataframe:\n', df)
print('Starting value of df: %d' % df['Jason'][1])
print('Starting value of reference_to_df: %d\n' % copy_of_df['Jason'][1])

# Make some modifications:
df.at[1, 'Jason'] = df['Jason'][1] + 5
print('Updated value of df: %d' % df['Jason'][1])
print('Updated value of reference_to_df: %d\n' % copy_of_df['Jason'][1])


Original dataframe:
    Eleanor  Chidi  Tahani  Jason  Janet
0       75     58      95     19    114
1       88     40      59     21     75
2       48      5       0     51     51
Starting value of df: 21
Starting value of reference_to_df: 21

Updated value of df: 26
Updated value of reference_to_df: 26

Original dataframe:
    Eleanor  Chidi  Tahani  Jason  Janet
0       75     58      95     19    114
1       88     40      59     26     75
2       48      5       0     51     51
Starting value of df: 26
Starting value of reference_to_df: 26

Updated value of df: 31
Updated value of reference_to_df: 26

