<a href="https://colab.research.google.com/github/AdicherlaVenkataSai/GooglecrashML/blob/master/2.%20Pandas_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

  * A DataFrame stores data in cells. 
  * A DataFrame has named columns (usually) and numbered rows.

In [0]:
import numpy as np
import pandas as pd

## Creating a DataFrame

The following code cell creates a simple DataFrame containing 10 cells organized as follows:

  * 3 rows
  * 2 columns, one named `age` and the other named `tooth`

The following code cell instantiates a `pd.DataFrame` class to generate a DataFrame. The class takes two arguments:

  * The first argument provides the data to populate the 10 cells. The code cell calls `np.array` to generate the 5x2 NumPy array.
  * The second argument identifies the names of the two columns.

In [0]:
#creating a numpy array 3*2
data = np.array([[10,15],[13,14],[16,18]])

#creating a python lists that holds the names of the 2 cols
cols = ['age', 'tooth']

#creating a datafram
df = pd.DataFrame(data = data, columns = cols)

#print the entire dataframe
df# print(df)

Unnamed: 0,age,tooth
0,10,15
1,13,14
2,16,18


## Adding a new column to a DataFrame

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name

In [0]:
#creating a new column named expected_teeth
#as i know the number of people in age var , 3
for i in range(3):

  if( df['age'][i] >= 10):
    df['expected_teeth'] = df['tooth'] + 2
  else:
    df['expected_teeth'] = df['tooth'] + 0

df

   



Unnamed: 0,age,tooth,expected_teeth
0,10,15,15
1,13,14,14
2,16,18,18


## Specifying a subset of a DataFrame

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame. 

In [0]:
print("#rows #0, #1, and #2:")
print(df.head(3), '\n')

print('row #2:')
print(df.iloc[[2]],'\n')

print('rows #1, #2, and #3:')
print(df[1:4], '\n')#excluding 4

print("column - 'age' :")
print(df['age'],'\n')

#rows #0, #1, and #2:
   age  tooth  expected_teeth
0   10     15              15
1   13     14              14
2   16     18              18 

row #2:
   age  tooth  expected_teeth
2   16     18              18 

rows #1, #2, and #3:
   age  tooth  expected_teeth
1   13     14              14
2   16     18              18 

column - 'age' :
0    10
1    13
2    16
Name: age, dtype: int64 



In [0]:
data = np.random.randint(low =0, high= 100, size = (3, 4))

cols = ['Eleanor', 'Chidi', 'Tahani','Jason']

df1 = pd.DataFrame(data = data, columns = cols)

print(df1)
print(df1['Eleanor'])

df1['Janet'] = df1['Tahani'] + df1['Jason']
print(df1)

   Eleanor  Chidi  Tahani  Jason
0        5     86      72     63
1       12     31      90     30
2       21     33       9     40
0     5
1    12
2    21
Name: Eleanor, dtype: int64
   Eleanor  Chidi  Tahani  Jason  Janet
0        5     86      72     63    135
1       12     31      90     30    120
2       21     33       9     40     49


## Copying a DataFrame (optional)

Pandas provides two different ways to duplicate a DataFrame:

* **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other. 
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other. 

The difference is subtle, but important

In [0]:
# creating a reference by assign df to a new variable
print('exeperiment with a reference')
df_ref = df1

#printing the starting vales of a particular cell
print('starting vales of df: %d' % df1['Jason'][1])
print("  Starting value of df_ref: %d\n" % df_ref['Jason'][1])


#modify a cell in df1
df1.at[1, 'Jason'] = df1['Jason'][1] + 15
print("  Updated df: %d" % df1['Jason'][1])
print("  Updated reference_to_df: %d\n\n" % df_ref['Jason'][1])


#creating a true copy of dataframe
print("Experiment with a true copy:")
copy_of_my_dataframe = df.copy()


# Print the starting value of a particular cell.
print("  Starting value of my_dataframe: %d" % df['tooth'][1])
print("  Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['tooth'][1])

# Modify a cell in df.
df.at[1, 'tooth'] = df['tooth'][1] + 3
print("  Updated my_dataframe: %d" % df['tooth'][1])
print("  copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['tooth'][1])

exeperiment with a reference
starting vales of df: 95
  Starting value of df_ref: 95

  Updated df: 110
  Updated reference_to_df: 110


Experiment with a true copy:
  Starting value of my_dataframe: 14
  Starting value of copy_of_my_dataframe: 14

  Updated my_dataframe: 17
  copy_of_my_dataframe does not get updated: 14
