# Data frames

## DataFrames in Pandas (Python Data Analysis Library)
*  Pandas DataFrame is a two-dimensional data structure  
*  Columns can be different data type
*  Think of dataframe as a table. Each column represents a common data type
*  It is relatively convenient to manipulate data and extract valuable information


### Creating DataFrames
**Dictionaries in Python**: We will use dictionaries in Python. Consider column names as “Keys” and list of items under that column as *“Values”*.

In [None]:
import pandas as pd

In [None]:
pd.DataFrame({'First Name': ['James', 'Alexander', 'Ashley'], 'Last Name':['Cook', 'Bell', 'Mason'], 'Age': [38, 42, 23]})

In [None]:
my_dict = {'First Name': ['James', 'Alexander', 'Ashley'], 
           'Last Name':['Cook', 'Bell', 'Mason'], 
           'Age': [38, 42, 23]}

In [None]:
df = pd.DataFrame(my_dict)
df

Specify row indices externally

In [None]:
df = pd.DataFrame(my_dict, index=[1, 2, 3]) #numeric row indexing
df

In [None]:
type(df)

In [None]:
help(df)

Specify row indices internally

In [None]:
my_dict = {'First Name': {1: 'James', 2: 'Alexander', 3: 'Ashley'}, #1,2,3 will indicate row indices in dataframe
           'Last Name': {2: 'Cook', 2: 'Bell', 3: 'Mason'}, 
           'Age': {1: 38, 2: 42, 3: 23}} 
df = pd.DataFrame(my_dict)
df

A column of a DataFrame will always be of same type

In [None]:
df['Age'].dtype

In [None]:
df.Age.dtype

In [None]:
df.dtypes

### List of Lists

In [None]:
myList = [['James','Cook',38],
          ['Alexander','Bell',42],
          ['Ashley','Mason',23]]
myList

In [None]:
df2 = pd.DataFrame(myList, columns = ['First Name', 'Last Name', 'Age'])
df2

In [None]:
df2.dtypes

Dataframe to dictionary

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]} 

df = pd.DataFrame(my_dict, index = ['a','b','c'])

df

In [None]:
df.to_dict() 

Dataframe to list

In [None]:
my_list = df.values.tolist() # List of lists. we loose column names
my_list

### DataFrame Indices

In [None]:
df

In [None]:
df = pd.DataFrame(
     my_dict, 
     index=['First Row', 'Second Row', 'Third Row'])#string row indexing
df

In [None]:
np_arr = np.array([1,2,3]) #homojeneous data type in numpy array
df = pd.DataFrame(my_dict, index = np_arr) #set numpy array as index
df.index = ['One', 'Two', 'Three'] #if df is already created
df

In [None]:
myIndices = list(df.index) #assign dataframe indices to a variable as a list
myIndices

.set_index()

In [None]:
help(df.set_index)

In [None]:
df_new = df.set_index('Age') # Creates a new dataframe with indices named “Age”
df.set_index('Age', inplace = True) # Modifies existing dataframe
df # Updated

### DataFrame Columns

In [None]:
myColumns = list(df.columns) #assign dataframe columns to a variable as a list
myColumns

In [None]:
df = pd.DataFrame(my_dict, index = [1, 2, 3])
df.columns = ['First_Name', 'Last_Name', 'Age'] 
df

In [None]:
df_new = df[['First_Name']] #returns a new dataframe with a single column from df
df_new

In [None]:
df_new = df[['First_Name','Age']]
df_new

In [None]:
type(df_new)

In [None]:
df = df[['Age', 'First_Name', 'Last_Name']] #re-arranging the columns 
df

In [None]:
ps = df['First_Name']
ps

In [None]:
type(ps)

Some Useful Column Functions

In [None]:
df.Age.mean() 

In [None]:
df['Age'].mean()

In [None]:
np.mean(df['Age'])

In [None]:
df.First_Name.unique()

Loading Specific Columns

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]} 
df = pd.DataFrame(my_dict) #loads all columns from the dictionary
df1 = pd.DataFrame(my_dict, columns = ["First Name", "Last Name"])
df1 ##only specific columns are loaded (saves memory)


Deleting Columns

In [None]:
del df1["First Name"] #removes a column from existing dataframe
df1

Drop Function: deleting row or columns

In [None]:
help(df.drop)

In [None]:
new_df = df.drop(labels='Age', axis=1) #drops column “Age” and creates a new dataframe. Existing df stays intact. 
df.drop(labels ='Age', axis = 1, inplace = True) #drops column “Age” and updates existing df. default inplace=False
df

In [None]:
new_df = df.drop(0) #delete the first row and create a new dataframe.
new_df 

In [None]:
df.drop(1, inplace = True) #deletes row labeled 1 from df.
df

Drops multiple columns and rows

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]} 
df = pd.DataFrame(my_dict)
new_df = df.drop(labels = ['First Name','Age'], axis = 1) #drops multiple columns
new_df

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]} 
df = pd.DataFrame(my_dict)
new_df = df.drop(labels = [1,2], axis = 0)# delete rows with index “1“ and “2”
new_df

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]} 
df = pd.DataFrame(my_dict)
new_df = new_df = df.drop([1,2]) # delete rows with index “1“ and “2”
new_df

### Dimensions

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]} 

df = pd.DataFrame(my_dict, index=[1, 2, 3])

del df['Age'] #now it 3 rows 2 columns dataframe

In [None]:
dim = df.shape #stores dimension with a pair in a tuple
dim

In [None]:
nrows = dim[0]
nrows

In [None]:
ncols = dim[1]
ncols

### Data Selection

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
            'Last Name': ['Cook','Bell','Mason'],
             'Age': [38, 42, 23]} 
df = pd.DataFrame(my_dict, index=[1,2,3])

In [None]:
df[0:2] #select first two rows and all columns and create a new dataframe

In [None]:
df[0:2][['Age']] #select first two rows and column named “Age”, create a new data frame

In [None]:
df[1:2][['Age','Last Name']] #select second row and columns “Age” and “Last Name”

**Note:** We cannot select two or more rows without a sequential manner.

**.loc** selection by label. Works for label-based data selection. You must specify rows and columns based on dataframe row and column labels

In [None]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
            'Last Name': ['Cook','Bell','Mason'],
             'Age': [38, 42, 23]} 
df = pd.DataFrame(my_dict, index=[1,2,3])

In [None]:
df.loc[[1,3]] #return two rows and all columns with index name 1 and 3

In [None]:
df.loc[[1,3]][['Age']] #returns two rows and columns named “Age”

In [None]:
df.loc[1:3] #this is not slicing. this means “select rows from 1 to 3”

In [None]:
df.loc[1:] # select all rows starting from row label 1 

In [None]:
new_df = pd.DataFrame(my_dict, index = ['a','b','c'])
new_df

In [None]:
new_df.loc[1:3] #throws and error as data slicing is not allowed

In [None]:
new_df.loc['a':'c'] #return rows from a to c. 

**.iloc** Selection by position

In [None]:
new_df.loc[1:3] #throws and error as data slicing is not allowed

In [None]:
new_df.iloc[1:3] #select from second row to third row

In [None]:
new_df.iloc[[0,2]] #returns first row and third row

In [None]:
new_df.iloc[[0,2]] [['Age']] #returns first row and third row and column label “Age”

In [None]:
new_df.iloc[[0, 2], [2]] #returns first row and third row and third column 

In [None]:
new_df.iloc[[0, 2], [0,2]] #returns more columns

### Online Data Sets

In [None]:
import seaborn as sns
iris = sns.load_dataset('iris') 
type(iris)
iris.head()

In [None]:
import pandas as pd
nba = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv") 
type(nba)
nba.head()

In [None]:
import pandas as pd
mtcars = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
mtcars.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
mtcars.head()

### Missing Values

In [3]:
import numpy as np

In [None]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, 3, 5],
                   [np.nan, 3, np.nan, 4]],
                   columns=list('ABCD'))
df

In [None]:
df.fillna(0)

Replace all NaN elements in column 'A', 'B', 'C', and 'D', with the mean of each columns.

In [None]:
values = {"A": np.mean(df.A), "B": np.mean(df.B), "C": np.mean(df.C), "D": np.mean(df.D)}
df.fillna(value = values)

Only replace the first NaN element.