Pandas DataFrames are two dimensional data structures with labeled rows and columns, that can hold many data types. 

**Create a DataFrame Manually**

Creating a DataFrame manually from a dictionary of Pandas Series.
1. The first step is to create the dictionary of Pandas Series.
2. After the dictionary is created we can then pass the dictionary to the pd.DataFrame() function.


**Example 1: Create a DataFrame using a dictionary of Series**

In [1]:
import pandas as pd

items = {'Alice': pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants']),
          'Bob': pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch'])}

print(type(items))        

<class 'dict'>


In [2]:
shopping_carts = pd.DataFrame(items)
shopping_carts

Unnamed: 0,Alice,Bob
bike,500.0,245.0
book,40.0,
glasses,110.0,
pants,45.0,25.0
watch,,55.0


**Example 2: DataFrame assigns the numerical row indexes by default**

In [3]:
data = {'Alice': pd.Series(data = [40, 110, 500, 45]),
          'Bob': pd.Series(data = [245, 25, 55])}

df = pd.DataFrame(data)

df

Unnamed: 0,Alice,Bob
0,40,245.0
1,110,25.0
2,500,55.0
3,45,


In [4]:
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2)
shopping_carts has dimension: 2
shopping_carts has a total of: 10 elements

The data in shopping_carts is:
 [[500. 245.]
 [ 40.  nan]
 [110.  nan]
 [ 45.  25.]
 [ nan  55.]]

The row index in shopping_carts is: Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

The column index in shopping_carts is: Index(['Alice', 'Bob'], dtype='object')


In [5]:
bob_shopping_cart = pd.DataFrame(items, columns = ['Bob'])
bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


**Selecting Spesific Rows of a DataFrame**

In [6]:
selected_items = pd.DataFrame(items, index = ['pants', 'book'])
selected_items

Unnamed: 0,Alice,Bob
pants,45,25.0
book,40,


**Selecting Spesific Columns of a DataFrame**

In [7]:
alice_selected_items = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])
alice_selected_items

Unnamed: 0,Alice
glasses,110
bike,500


You can also manually create DataFrames from a dictionary of lists (arrays). The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame() function. In this case, however, all the lists (arrays) in the dictionary must be of the same length.

**Create a DataFrame using a dictionary of lists**

In [8]:
data = {'Floats': [4.5, 8.2, 9.6],
        'Integers': [1, 2, 3]}

df = pd.DataFrame(data)

df

Unnamed: 0,Floats,Integers
0,4.5,1
1,8.2,2
2,9.6,3


**Create a DataFrame using a dictionary of lists, and custom row-indexes (labels)**

In [9]:
data = {'Floats': [4.5, 8.2, 9.6],
        'Integers': [1, 2, 3]}

df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

df

Unnamed: 0,Floats,Integers
label 1,4.5,1
label 2,8.2,2
label 3,9.6,3


**Access Elements Using Labels**

In [10]:
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])
store_items

Unnamed: 0,bikes,pants,watches,glasses
store 1,20,30,35,
store 2,15,5,10,50.0


In [11]:
print(store_items)

# We access rows, columns and elements using labels
print()
print('How many bikes are in each store:\n', store_items[['bikes']])
print()
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print()
print('What items are in Store 1:\n', store_items.loc[['store 1']])
print()
print('How many bikes are in Store 2:', store_items['bikes']['store 2'])

         bikes  pants  watches  glasses
store 1     20     30       35      NaN
store 2     15      5       10     50.0

How many bikes are in each store:
          bikes
store 1     20
store 2     15

How many bikes and pants are in each store:
          bikes  pants
store 1     20     30
store 2     15      5

What items are in Store 1:
          bikes  pants  watches  glasses
store 1     20     30       35      NaN

How many bikes are in Store 2: 15


It is important to know that when accessing individual elements in a DataFrame, the labels should always be provided with the column label first, i.e., in the form dataframe[column][row]. For example, when retrieving the number **bikes** in **store 2**, we first used the column label bikes and then the row label store 2. If we provide the row label first we will get an error. 

**Add a column to an existing DataFrame**

In [12]:
store_items['shirts'] = [15, 2]

store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts
store 1,20,30,35,,15
store 2,15,5,10,50.0,2


In [13]:
store_items['suits'] = store_items['pants'] + store_items['shirts']

store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits
store 1,20,30,35,,15,45
store 2,15,5,10,50.0,2,7


Suppose now, that you opened a new store and you need to add the number of items in the stock of that new store into your DataFrame. We can do this by adding a new row to the **store_items** Dataframe. To add rows to our DataFrame we first have to create a new Dataframe and then append it to the original DataFrame.

In [14]:
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]
new_store = pd.DataFrame(new_items, index = ['store 3'])

new_store

Unnamed: 0,bikes,pants,watches,glasses
store 3,20,30,35,4


**Append the row to the DataFrame**

In [15]:
store_items = pd.concat([store_items, new_store])

store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits
store 1,20,30,35,,15.0,45.0
store 2,15,5,10,50.0,2.0,7.0
store 3,20,30,35,4.0,,


In [16]:
store_items['new watches'] = store_items['watches'][1:]
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits,new watches
store 1,20,30,35,,15.0,45.0,
store 2,15,5,10,50.0,2.0,7.0,10.0
store 3,20,30,35,4.0,,,35.0


It is also possible, to insert new columns into the DataFrames anywhere we want. The dataframe.insert(loc, label, data) method allows us to insert a new column in the dataframe at location loc, with the given column label, and given data. Let's add new column named shoes right before the suits column. Since suits has numerical index value 4 then we will use this value as loc. 

**Add new column at a specific location**

In [17]:
store_items.insert(4, 'shoes', [8, 5, 0])

store_items

Unnamed: 0,bikes,pants,watches,glasses,shoes,shirts,suits,new watches
store 1,20,30,35,,8,15.0,45.0,
store 2,15,5,10,50.0,5,2.0,7.0,10.0
store 3,20,30,35,4.0,0,,,35.0


Just as we can add rows and columns we can also delete them. To delete rows and columns from our DataFrame we will use the **.pop()** and **.drop()** methods. The **.pop()** method only allows us **to delete columns**, while the **.drop()** method can be used **to delete both rows and columns by use of the axis keyword**.

**Delete one column from a DataFrame**

In [18]:
store_items.pop('new watches')
store_items

Unnamed: 0,bikes,pants,watches,glasses,shoes,shirts,suits
store 1,20,30,35,,8,15.0,45.0
store 2,15,5,10,50.0,5,2.0,7.0
store 3,20,30,35,4.0,0,,


**Delete Multiple Columns from a DataFrame**

In [19]:
store_items = store_items.drop(['watches', 'shoes'], axis = 1)

store_items

Unnamed: 0,bikes,pants,glasses,shirts,suits
store 1,20,30,,15.0,45.0
store 2,15,5,50.0,2.0,7.0
store 3,20,30,4.0,,


**Delete rows from a DataFrame**

In [20]:
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)

store_items

Unnamed: 0,bikes,pants,glasses,shirts,suits
store 3,20,30,4.0,,


Sometimes we might need to change the row and column labels. Let's change the bikes column label to hats using the **.rename()** method

**Modify the Column Label**

In [21]:
# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})

# we display the modified DataFrame
store_items

Unnamed: 0,hats,pants,glasses,shirts,suits
store 3,20,30,4.0,,


**Modify the Row Label**

In [22]:
store_items = store_items.rename(index = {'store 3': 'last store'})

store_items

Unnamed: 0,hats,pants,glasses,shirts,suits
last store,20,30,4.0,,


As mentioned earlier, before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, suc as outliers or incorrect values, the type of bad data we encounter almost always is missing values. Pandas assigns **NaN** values to missing data. 

**Example 1: Create a DataFrame**

In [23]:
items2 = [{'bikes':20, 'pants':30, 'watches':35, 'shirts':15, 'shoes': 8, 'suits': 45},
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5, 'shirts': 2, 'shoes': 5, 'suits': 7},
          {'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


We can clearly see that the DataFrame we created has 3 NaN values: one in store 1 and two in store 3. However, in cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of NaN values is not easily visualized. For these cases, we can use a combination of methods to count the number of NaN values in our data. The following example combines the **.isnull()** and the **sum()** methods to count the number of NaN values in our DataFrame

**Example 2a: Count the total NaN values**

In [31]:
x1 = store_items.isnull()
print("X1")
print(x1)

x2 = store_items.isnull().sum()
print("X2")
print(x2)

x3 = store_items.isnull().sum().sum()
print("X3")
print(x3)

X1
         bikes  pants  watches  shirts  shoes  suits  glasses
store 1  False  False    False   False  False  False     True
store 2  False  False    False   False  False  False    False
store 3  False  False    False    True  False   True    False
X2
bikes      0
pants      0
watches    0
shirts     1
shoes      0
suits      1
glasses    1
dtype: int64
X3
3


In Pandas, logical True values have numerical value 1 and logical False values have numerical value 0. Therefore, we can count the number of NaN values by counting the number of logical True values. In order to count the total number of logical True values we use the .sum() method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical True values along columns

<table>
    <thead>
        <tr>
            <th>Item</th>
            <th>NaN Count</th>
        </tr>
    </thead>
    <tbody>
        <tr><td>bikes</td><td>0</td></tr>
        <tr><td>pants</td><td>0</td></tr>
        <tr><td>watches</td><td>0</td></tr>
        <tr><td>shirts</td><td>1</td></tr>
        <tr><td>shoes</td><td>0</td></tr>
        <tr><td>suits</td><td>1</td></tr>
        <tr><td>glasses</td><td>1</td></tr>
    </tbody>
</table>
<p><code>dtype: int64</code></p>


**Eliminating NaN Values**

Now that we learned how to know if our dataset has any NaN values in it, the next step is to decide what to do with them. In general, we have two options, we can either delete or replace the **NaN** values. We will start by learning how to eliminate rows or columns from our DataFrame that contain any NaN values. The .dropna(axis) method eliminates any rows with NaN values when axis = 0 is used and will eliminate any columns with NaN values when axis = 1 is used.t.

**Example 4. Drop rows having NaN values**

In [32]:
store_items.dropna(axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 2,15,5,10,2.0,5,7.0,50.0


**Example 5. Drop columns having NaN values**

In [33]:
store_items.dropna(axis = 1)

Unnamed: 0,bikes,pants,watches,shoes
store 1,20,30,35,8
store 2,15,5,10,5
store 3,20,30,35,10


Notice that the **.dropna()** method eliminates (drops) the rows or columns with **NaN** values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keyword **inplace = True** inside the **dropna()** function.

**Substituting NaN Values**

Now, instead of eliminating **NaN** values, we can replace them with suitable values. We could choose for example to replace all **NaN** values with the value 0. We can do this by using the **.fillna()** method.

**Example 6: Replace NaN with 0**

In [34]:
store_items.fillna(0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,0.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,0.0,10,0.0,4.0


We can also use the **.ffill()** method to replace **NaN** values with previous values in the DataFrame, this is known as **forward filling**. When replacing NaN values with forward filling, we can use previous values taken from columns or rows. The **.ffill(axis)** will use the forward filling method to replace NaN values using the previous known value along the given axis.

**Example 7: Forward fill NaN values down (axis = 0) the dataframe**

In [36]:
store_items.ffill(axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


Notice that the two NaN values in store 3 have been replaced with previous values in their columns. However, notice that the NaN value in store 1 didn't get replaced. That's because there are no previous values in this column, since the NaN value is the first value in that column. However, if we do forward fill using the previous row values, this won't happen.

**Example 8: Forward fill NaN values across (axis = 1) the dataframe**

In [37]:
store_items.ffill(axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,35.0,10.0,10.0,4.0


We see that in this case all the NaN values have been replaced with the previous row values.

Similarly, you can choose to replace the NaN values with the values that go after them in the DataFrame, this is known as **backward filling**. The **.bfill(axis)** will use the backward filling method to replace **NaN** values using the next known value along the given axis. Just like with forward filling we can choose to use row or column values.

**Example 9. Backward fill NaN values *down* (axis=0) the dataframe**

In [38]:
store_items.bfill(axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,50.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


Notice that the **NaN** value in **store 1** has been replaced with the next value in its column. However, notice that the two **NaN** values in **store 3** didn't get replaced. That's because there are no next values in these columns, since these **NaN** values are the last values in those columns. However, if we do backward fill using the next row values, this won't happen. 

**Example 9. Backward fill NaN values *down* (axis=1) the dataframe**

In [41]:
store_items.bfill(axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,10.0,10.0,4.0,4.0


Notice that the **.fillna()**, **.ffill()**, and **.bfill()** methods **replace (fill)** NaN values **out of place** by default. This means that the original DataFrame **remains unchanged** unless explicitly modified. You can always replace the NaN values in place by setting the parameter **inplace=True** within these functions.

We can also choose to replace NaN values by using different interpolation methods. For example, the **.interpolate(method='linear', axis)** method will use linear interpolation to replace NaN values using the values along the given axis.

**Example 11. Interpolate (estimate) NaN values down (axis=0) the dataframe**

In [42]:
store_items.interpolate(method = "linear", axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,2.0,10,7.0,4.0


Notice that the two NaN values in store 3 have been replaced with linear interpolated values. However, notice that the NaN value in store 1 didn't get replaced. That's because the NaN value is the first value in that column, and since there is no data before it, the interpolation function can't calculate a value.

**Example 12. Interpolate (estimate) NaN values across (axis=1) the dataframe**

In [43]:
store_items.interpolate(method = "linear", axis = 1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
store 2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
store 3,20.0,30.0,35.0,22.5,10.0,7.0,4.0


Just as with the other methods we saw, the .interpolate() method replaces NaN values out of place.