<a href="https://colab.research.google.com/github/Karmabir-Brahma/ML/blob/master/Pandas_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

## Creating a DataFrame

The following code cell creates a simple DataFrame containing 10 cells organized as follows:

  * 5 rows
  * 2 columns, one named `temperature` and the other named `activity`

The following code cell instantiates a `pd.DataFrame` class to generate a DataFrame. The class takes two arguments:

  * The first argument provides the data to populate the 10 cells. The code cell calls `np.array` to generate the 5x2 NumPy array.
  * The second argument identifies the names of the two columns.

In [11]:
#Create and populate a 5x2 NumPy array
my_data = np.array([[1,6],[23,4],[10,3],[29,5],[37,9]])
print(my_data)

#Create a python list that holds the names of the two columns
my_columns = ['temperature','activity']

#Create a DataFrame
my_dataframe = pd.DataFrame(data=my_data, columns=my_columns)

#Print the entire DataFrame
print(my_dataframe)

[[ 1  6]
 [23  4]
 [10  3]
 [29  5]
 [37  9]]
   temperature  activity
0            1         6
1           23         4
2           10         3
3           29         5
4           37         9


## Adding a new column to a DataFrame

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name. For example, the following code creates a third column named `adjusted` in `my_dataframe`: 

In [None]:
my_dataframe['adjusted'] = my_dataframe['activity']+2
print(my_dataframe)

   temperature  activity  adjusted
0            1         6         8
1           23         4         6
2           10         3         5
3           29         5         7
4           37         9        11


## Specifying a subset of a DataFrame

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame. 

In [None]:
print("Rows #0, #1 and #2:")
print(my_dataframe.head(3),'\n')

print("Row #2:")
print(my_dataframe.iloc[[2]],'\n')

print("Rows #1, #2 and #3:")
print(my_dataframe[1:4],'\n')

print("Cloums 'tempereture':")
print(my_dataframe['temperature'])

Rows #0, #1 and #2:
   temperature  activity  adjusted
0            1         6         8
1           23         4         6
2           10         3         5 

Row #2:
   temperature  activity  adjusted
2           10         3         5 

Rows #1, #2 and #3:
   temperature  activity  adjusted
1           23         4         6
2           10         3         5
3           29         5         7 

Cloums 'tempereture':
0     1
1    23
2    10
3    29
4    37
Name: temperature, dtype: int64


## Task 1: Create a DataFrame

Do the following:

  1. Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named `Eleanor`,  `Chidi`, `Tahani`, and `Jason`.  Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.

  2. Output the following:

     * the entire DataFrame
     * the value in the cell of row #1 of the `Eleanor` column

  3. Create a fifth column named `Janet`, which is populated with the row-by-row sums of `Tahani` and `Jason`.

To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial. 


In [None]:
datas = np.random.randint(low=0, high=101, size=(3,4))
column_name = ['Eleanor','Chidi','Tahani','Jason']

my_DataFrame = pd.DataFrame(data=datas,columns=column_name)
print(my_DataFrame,'\n')

print("The value in the cell of row #1 of the Eleanor column: %d\n" % my_DataFrame['Eleanor'][1])

my_DataFrame['Janet'] = my_DataFrame['Tahani'] + my_DataFrame['Jason']
print(my_DataFrame)

   Eleanor  Chidi  Tahani  Jason
0       53     21      95     70
1       68     28      14     66
2       48     66      10     76 

The value in the cell of row #1 of the Eleanor column: 68

   Eleanor  Chidi  Tahani  Jason  Janet
0       53     21      95     70    165
1       68     28      14     66     80
2       48     66      10     76     86


In [None]:
#@title Double-click for a solution to Task 1.

# Create a Python list that holds the names of the four columns.
my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']

# Create a 3x4 numpy array, each cell populated with a random integer.
my_data = np.random.randint(low=0, high=101, size=(3, 4))

# Create a DataFrame.
df = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(df)

# Print the value in row #1 of the Eleanor column.
print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1])

# Create a column named Janet whose contents are the sum
# of two other columns.
df['Janet'] = df['Tahani'] + df['Jason']

# Print the enhanced DataFrame
print(df)

## Copying a DataFrame (optional)

Pandas provides two different ways to duplicate a DataFrame:

* **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other. 
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other. 

The difference is subtle, but important.

In [None]:
#Using Reference
print("Experiment with a reference:")
ref_dataframe = my_DataFrame

print("Existing value of my_DataFrame: %d" %my_DataFrame['Chidi'][1])
print("Existing value of ref_dataframe: %d \n" %ref_dataframe['Chidi'][1])

#Modify a cell in my_DataFrame
my_DataFrame.at[1,'Chidi'] = my_DataFrame['Chidi'][1]+2

print("Updated my_DataFrame: %d" %my_DataFrame['Chidi'][1])
print("Updated ref_dataframe: %d" %ref_dataframe['Chidi'][1])

Experiment with a reference:
Existing value of my_DataFrame: 38
Existing value of ref_dataframe: 38 

Updated my_DataFrame: 40
Updated ref_dataframe: 40


In [12]:
#Using copy
print("Experiment with a copy:")
copy_DataFrame = my_dataframe.copy()

print("Existing value of temp row 1 in real: %d" %my_dataframe['temperature'][1])
print("Existing value of temp row 1 in copied: %d\n" %copy_DataFrame['temperature'][1])

#Modify a cell in my_dataframe
print("Modify a cell in my_dataframe")
my_dataframe.at[1,'temperature'] = my_dataframe['temperature'][1]+2

print("Updated value of temp row 1 in real: %d" %my_dataframe['temperature'][1])
print("Updated value of temp row 1 in copied: %d \n" %copy_DataFrame['temperature'][1])

print("Existing value of activity row 0 in real: %d" %my_dataframe['activity'][0])
print("Existing value of activity row 0 in copied: %d\n" %copy_DataFrame['activity'][0])

#Modify a cell in copy_dataframe
print("Modify a cell in copy_dataframe")
copy_DataFrame.at[0, 'activity'] = copy_DataFrame['activity'][0]+4

print("Updated value of activity row 0 in real: %d" %my_dataframe['activity'][0])
print("Updated value of activity row 0 in copied: %d" %copy_DataFrame['activity'][0])


Experiment with a copy:
Existing value of temp row 1 in real: 23
Existing value of temp row 1 in copied: 23

Modify a cell in my_dataframe
Updated value of temp row 1 in real: 25
Updated value of temp row 1 in copied: 23 

Existing value of activity row 0 in real: 6
Existing value of activity row 0 in copied: 6

Modify a cell in copy_dataframe
Updated value of activity row 0 in real: 6
Updated value of activity row 0 in copied: 10
