# DataFrame

A DataFrame is a tabular data structure containing an ordered collection of columns.  Each column can have a different type.  DataFrames have both row and column indices and is analogous to a dict of Series.  Row and column operations are treated roughly symmetrically.  Columns returned when indexing a DataFrame are views of the underlying data, not a copy.  To obtain a copy, use the Series' copy method.

In [3]:
import pandas as pd

## Create a Dataframe (df)

### 1) Create a dataframe from a dictionary.

In [4]:
data1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
         'year' : [2012, 2013, 2014, 2014, 2015],
         'pop' : [5.0, 5.1, 5.2, 4.0, 4.1] }

df1 = pd.DataFrame(data1)
df1

Unnamed: 0,state,year,pop
0,VA,2012,5.0
1,VA,2013,5.1
2,VA,2014,5.2
3,MD,2014,4.0
4,MD,2015,4.1


Create a dataframe specifying a sequence of columns.

In [5]:
df2 = pd.DataFrame(data1, columns=['year', 'state'])
df2

Unnamed: 0,year,state
0,2012,VA
1,2013,VA
2,2014,VA
3,2014,MD
4,2015,MD


Like Series, columns that are not present in the the data are NaN.

In [6]:
df3 = pd.DataFrame(data1, columns=['year', 'state', 'pop', 'unempl']) 
df3

Unnamed: 0,year,state,pop,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,
2,2014,VA,5.2,
3,2014,MD,4.0,
4,2015,MD,4.1,


### 2) Create a DataFrame from a nested dict of dictionarys

The keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified.

In [7]:
ndd = {'VA' : {2013 : 5.1, 2014 : 5.2},    #ndd = nested dictionary of dictionary
       'MD' : {2014 : 4.0, 2015 : 4.1}}

df4 = pd.DataFrame(ndd)
df4

Unnamed: 0,VA,MD
2013,5.1,
2014,5.2,4.0
2015,,4.1


### 3) Create a dataframe from dictionary of Series

In [8]:
df4['VA'][1:]

2014    5.2
2015    NaN
Name: VA, dtype: float64

In [9]:
data2 = {'VA' : df4['VA'][1:],
        'MD' : df4['MD'][1:]}

df5 = pd.DataFrame(data2)
df5

Unnamed: 0,VA,MD
2014,5.2,4.0
2015,,4.1


## Initial Operations

### Get information

In [10]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    5 non-null      int64  
 1   state   5 non-null      object 
 2   pop     5 non-null      float64
 3   unempl  0 non-null      object 
dtypes: float64(1), int64(1), object(2)
memory usage: 288.0+ bytes


Transpose the DataFrame (similar to numpy):

In [11]:
df3.T.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, year to unempl
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3 non-null      object
 1   1       3 non-null      object
 2   2       3 non-null      object
 3   3       3 non-null      object
 4   4       3 non-null      object
dtypes: object(5)
memory usage: 364.0+ bytes


### Set index, column names

In [12]:
df5.index.name = 'year'
df5

Unnamed: 0_level_0,VA,MD
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2014,5.2,4.0
2015,,4.1


In [13]:
df5.columns.name = 'state'
df5

state,VA,MD
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2014,5.2,4.0
2015,,4.1


### Set column's type

<div class="alert alert-info">
    <strong>Tip:</strong> It is important to set the correct datatype in columns as it will allow you make specific operations in the future.
</div>

In [14]:
df3['year'] = df3['year'].astype(str)
df3['year']

0    2012
1    2013
2    2014
3    2014
4    2015
Name: year, dtype: object

### Create inidividual column

Assign a new column that doesn't exist to create a new column:

In [15]:
df3['new_year'] = df3['year']
df3

Unnamed: 0,year,state,pop,unempl,new_year
0,2012,VA,5.0,,2012
1,2013,VA,5.1,,2013
2,2014,VA,5.2,,2014
3,2014,MD,4.0,,2014
4,2015,MD,4.1,,2015


## Getting index/values

### Get column(s)

Retrieve a column by key, returning a Series.

In [16]:
df3['year']

0    2012
1    2013
2    2014
3    2014
4    2015
Name: year, dtype: object

Retrieve a column by attribute

In [17]:
df3.year

0    2012
1    2013
2    2014
3    2014
4    2015
Name: year, dtype: object

<strong>Question:</strong> ¿What type of structure is the last result?<br>
<strong>Ans:</strong> Series

In [18]:
type(df3['year'])

pandas.core.series.Series

In [19]:
pd.DataFrame(df3['year'])

Unnamed: 0,year
0,2012
1,2013
2,2014
3,2014
4,2015


Get multiples columns

In [20]:
df3[['year', 'state']]

Unnamed: 0,year,state
0,2012,VA
1,2013,VA
2,2014,VA
3,2014,MD
4,2015,MD


### Get row(s)

Retrieve data with loc.

In [30]:
df3.loc[2]

year        2014
pop          5.2
unempl       NaN
new_year    2014
Name: 2, dtype: object

In [32]:
df3.loc[2, 'year']

'2014'

Retrieve a row by position

In [31]:
df3.iloc[0]    #iloc = index location

year        2012
pop          5.0
unempl       NaN
new_year    2012
Name: 0, dtype: object

In [22]:
type(df3.iloc[0])

pandas.core.series.Series

Retrieve multiple rows with slice.

In [37]:
df3.iloc[2:3]

Unnamed: 0,year,pop,unempl,new_year
2,2014,5.2,,2014


<div class="alert alert-info">
    <strong>Tip:</strong> The main diference with <code>loc</code> and <code>iloc</code>:
    <ul>
        <li>
            Loc is for index name(number or string), iloc is for index (number).</li>
        <li>
            The last element in slicing with loc is incluse, in iloc is exclusive.
        </li>
        <li>
            If using index with numbers, iloc and loc are similar.</li>
    </ul>
</div>

Select a slice of rows from a specific column of a DataFrame:

In [39]:
df3.loc[0:2, 'pop']

0    5.0
1    5.1
2    5.2
Name: pop, dtype: float64

### Get data as 2D array

Return the data contained in a DataFrame as a 2D ndarray:

In [24]:
df5.values

array([[5.2, 4. ],
       [nan, 4.1]])

If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:

In [25]:
df3.values

array([['2012', 'VA', 5.0, nan, '2012'],
       ['2013', 'VA', 5.1, nan, '2013'],
       ['2014', 'VA', 5.2, nan, '2014'],
       ['2014', 'MD', 4.0, nan, '2014'],
       ['2015', 'MD', 4.1, nan, '2015']], dtype=object)

## Deleting index/values

### Deleting column(s)

In [26]:
if 'state' in df3.columns:
    del df3['state']
else:
    print("Column doesn't exist")