# Data Frame Introduction

- **DataFrame** is a 2-dimensional labeled data structure (labelled rows, and labelled columns (variable names)).


- Each column can be of a**different type (numeric, string, boolean, ...)**. 


- A DataFrame represents a tabular, spreadsheet-like structure containing an ordered collection of columns. 


- You can think of it like an excel **spreadsheet** or **SQL table**, or a **dict of Series objects**.


- It is like __data.frame__ R language object.

- It is generally the most commonly used pandas object

### Setting up the workspace

In [None]:
import pandas as pd
import numpy as np
from random import sample, choices, seed
from numpy.random import randn

#### DataFrame Exmple

In [None]:
# Creating index values
ind = ["R" + "_" + str(num) for num in range(1, 16)]

# Creating column names (Variables)
cols = ["VAR" + "_" + str(num) for num in range(1, 7)]

# Generating Random Data
my_data = randn(90).reshape(15, 6)

In [None]:
df = pd.DataFrame(data = my_data, index = ind, columns= cols)
df.head()

---
# Selection (Slicing), Creation and Deletion

   - Selecting, Creating and Deleting (dropping) is common task when dealing with data. Each process will be discussed seperately in this tutorial.
   

## Selection (Slicing)

### Variable Selection

- **String (label) indexing**: Passing the variable name to the data frame object to retrieve a specific variable. The syntax is shown here.

```python
        DataFrame['column-name']
  ```
  
#### Note: 

- When selecting one variable, the data frame becomes a **series**. This means that a data frame is a set of series objects.

In [None]:
# Select The First variable
var_1 = df['VAR_1']
var_1

In [None]:
# Examine the type 
type(var_1)

  indeed, __var_1__ is a pandas series object

### Remark: 

  - As mentioned in the definition, a data frame is just a set of series combined togther, (At least two series)
  
  
  - If we have two series, we can construct a data frame. 
  
  
  - We do that in the next example

In [None]:
var_2 = df['VAR_2']

> We extracted two series, now we join them together to form a new data frame by providing a dict of series.

In [None]:
df2 = pd.DataFrame(data = {'VAR_1': var_1, 'VAR_2':var_2})

In [None]:
df2.head()

In [None]:
# Check the type of df2
type(df2)

### Accessing more than one variable

  - If we want to select two or more variables from a data frame, we use a double square brackets. [[]]
  
```python

   df[['var1', 'var2', ...]]
```

In [None]:
df[['VAR_1', 'VAR_3']].head() 

I am chaining the __head()__ method to print only the first few observations

### Accessing the DataFrame Variable Using the Dot Operator

   - With a dataframe, we can access a variable using the __dot operator__, just like accessing  a method. 
   
   
   - This might be confusing at the first time, and some python programmers do not recommend using it. However, it is worth mentioning though, and we should know it.

In [None]:
df.VAR_1.head()

### Rows Selection


  -**Row Selection (slicing)** in pandas data frame can be done using two methods
  
#### 1. String or label-index:

  - **The _loc_ method**: a **label-index or (a string index)** is passed to the __loc method__ in order to select a row.  Here is the syntax
  
```python
  df.loc['string-index']
```

#### Selecting first row labelled as 'R_1'

In [None]:
first_row = df.loc['R_1']
first_row

##### Examine the type

In [None]:
type(first_row)

### Note:

   - Selecting one row is also a pandas series object. Thus, slicing one element __horizontally or vertically__ results in a __series object__.

In [None]:
# Select another row
df['R_6']  

#### 2. Integer-index Selection

 - **The _iloc method_**: This is the second way for slicing a data frame, by passing the **integer-index (position)** to the **iloc method**. Here is the syntax.
 
```python
df.iloc[int-index]
```

##### Selecting the first row (position is zero)

In [None]:
df.iloc[0]

In [None]:
# Select another row
df[5]

### Multiple Row Selection

  - Selecting more than one row (case, observation) is done using double square bracket [[...]] and pass that to either __loc method__ if label-index, or __iloc method__ if integer-index. 
  
  
  - In other words, **pass a list to either method**
  
```python
1. loc method
     df.loc[['Row1', 'Row2', '...']]
    
2. iloc method

    df.iloc[[1, 2, 4, ...]]
```

#### Select the first and the third Row

In [None]:
df.loc[['R_1', 'R_3']].head(3)

> We can have the same result using iloc as follows

In [None]:
df.iloc[[0, 2]].head(3)

> **Pandas Slicing is AWESOME isn't it!** 👊👊👊
---

### Row-Column Selection (Extremely Importand Section)

- After learning how to slice columns and rows, combining between them will result in __row-column selection__. Here is the syntax:

```python

1. One Value 

  df.loc['R_1', 'VAR_1'] or df.iloc[0, 0]
    
2. One row-multiple columns: we pass tow lists seperated by a comma.

   df.loc['R_1', ['VAR_1', 'VAR_3', ...]] 
    
3. Multiple rows- One column:
    
    df.loc[['R_1', 'R_3', ...], 'VAR_1'] 
    
4. Multiple rows - Multiple Columns: 
    
    df.loc[['R_1', 'R_3', ...], ['VAR_1', 'VAR_3', ...]]
```

> ##### 1. One-Value

In [None]:
df.loc['R_1', 'VAR_1'] 

In [None]:
# Or
df.iloc[0, 0]

> ##### 2. One row-Multiple columns

In [None]:
df.loc['R_1', ['VAR_1', 'VAR_3', 'VAR_6']] 

> ##### 3. Multiple rows- One column

In [None]:
df.loc[['R_1', 'R_3', 'R_5'], 'VAR_1'] 

> ##### 4. Multiple rows- Multiple columns

In [None]:
df.loc[['R_1', 'R_3'], ['VAR_1', 'VAR_3']]

In [None]:
df.loc[['R_1', 'R_3', 'R_5'], ['VAR_1', 'VAR_3', 'VAR_6']]

---
## Creation

### Adding New Variables 

   - Creating a new variable is straightforward. We just have to pass the new vriable name as if it already exists, then give it new values. The syntax is as follows:
   
```python
df['new_var'] = new_values
```

In [None]:
df['NEW_VAR'] = df['VAR_1'] + df['VAR_2']
df.head()

### Adding Empty Variable

   - If we pass a column that isn't contained in data, a series on __NaN__ will show up. 


In [None]:
df2 = pd.DataFrame(data = df,  index = ind,
        columns = ['VAR_1', 'VAR_2', 'VAR_3', 'VAR_4', 'VAR_5', 'VAR_6', 'EmptyVar'])
df2.head()

### Assigning values to Variables

   - Assigning new values to a variable is done just like a dictionary. We pass a value or list of values to the specified column. 

In [None]:
df2['EmptyVar'] = 19.8
df2.head()

> The same value appears on all rows.

#### We Change values as well by indexing a specified column

In [None]:
df2.loc['R_1':'R_5',['EmptyVar']] = 23.2
df2.head(7)

You may notice that I am using ':' between rows, this is because the index labels are in sequence. 

- We can even do that with variables if they are in sequence. Nice and short!!!.
---

In [None]:
df2.loc['R_1':'R_5', 'VAR_1':'VAR_4'].head() 

---
## Dropping

###  Dropping Variables

  - Dropping a variable can be done using the pandas data frame __drop method__.
  
  
  - **Drop method refers to the index not columns (the axis argument is set to zero by default (axis =0))**
  
  
  - **Dropping variables** is achieved by __setting the axis=1__ 
  
  
  - Pandas handles dropping variables carefully:
  
     - it protects us from deleting variable accidently (losing data is expensive). 
     
     - Therefore, there is and an argument called __inplace__, this to confirm whether we want to delete the variable permanently.
  
          - Setting __inplace = False__ (which is the default) will not delete the variable from the original data (it makes a copy of the data). 
     
          - Setting __inplace = True__ will delete (drop) the variable permanently.  

> #### Example: Dropping variables with inplace = True

In [None]:
# Set axis = 1
# df.drop('NEW_VAR', axis = 1, inplace = False)
df.drop('NEW_VAR', axis = 1, inplace = False).head()

> Check the original data frame

In [None]:
df.head()

 - Indeed, the variable is still there. And, if we want to delete the variabel permanently, we should set inplace to True.

In [None]:
df.drop('NEW_VAR', axis = 1, inplace = True)

In [None]:
df.head()

> The variable permanently removed

### Dropping Multiple Variables

  - This is done by providing a list of variable to drop method 
  
```python

  df.drop[['V_1', 'V_2', ...]]
```


In [None]:
df.drop(['VAR_1', 'VAR_3', 'VAR_6'], axis = 1).head(3)

### Droping Rows (Observations)

  - The drop method drops rows from the data frame by default
  
  
  - Passing a label-index (or integer-index) to drop one row
  
  
  - Passing a list of lable-index (or integer-index) will drop multiple rows
  
  
  - Setting __inplace = True__ will remove observations permanently. 

#### Dropping One Row

In [None]:
df.drop('R_1', axis = 0).head()

In [None]:
df.head()

 > The first is still there, because we didn't specify inplace = True
 we can do that like this
 
```python
df.drop('R_1', inplace = True)
```

#### Dropping Multiple Rows

  - Passing a list of rows to the drop method

```python

    df.drop(['R1', 'R2', ...])
```

In [None]:
df.drop(['R_1', 'R_3', 'R_5', 'R_14', 'R_15'])

---


# Understanding how Data Frame Are Constructed Internally

  - There are several ways to contruct a DataFrame. such as:
      - Dictionaries (the most common way)
      - NumPy nd-Arrays
      - dataclass (It is not discussed here)

## Constructing DataFrames form Dicts

In [None]:
data = {'Var1': [*range(1, 6)], 
       'Var2': [*range(2, 11, 2)], 
       'Var3': [*range(10, 51, 10)]}
data

In [None]:
dict_df = pd.DataFrame(data)
dict_df

#### Example Two: Constructing DataFrames from Dicts

In [None]:
dat = {'states': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
      'year': [2000, 2001, 2002, 2001, 2002], 
      'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(dat)
df

#### Rearrange the columns names

  > by passing a list of variable names to __columns__ argument, the columns will be exactly as what you pass
  

In [None]:
pd.DataFrame(dat, columns = ['year', 'states', 'pop'])

#### Adding row label-index

  > passing a list of string labels to the __index__ argument will label the data frame rows.

In [None]:
pd.DataFrame(dat, columns = ['year', 'states', 'pop'],
            index = ['A', 'B', 'C', 'D', 'E'])

## Constructing DataFrames form Dict of Dicts

   - If a dict of dicts passed to DataFrame, the __outer dict keys__ will be interpreted as __variables__, and the __inner dict kyes__ as the row indices.

In [None]:
grades = {'Ahmad': {'Math': 13, 'Stats': 17.5, 'Science': 0}, 
         'Nabil': {'Math': 5.5, 'Computing': 17.75, 'stats': 18.5}, 
         'Islam': {'Math': 7.5, 'Stats': 12.5, 'Algo': 18}}
grades

In [None]:
df_grades = pd.DataFrame(grades)
df_grades

> We see where there is no value, an __NaN__ appears in the DataFrame. 

### Tranposing the DataFrame.

  - We can always shift the rows to columns and the columns to rows by using __.T__ DataFrame ATTribute.

In [None]:
df_grades.T

#### Index name, Columns names

  - Data about data is of much help for us to understand the data. Thus, we can provide a name for the index and for columns.

In [None]:
# Getting the index
df_grades.index

In [None]:
# Giving a name to the index
df_grades.index.name = 'Subjects'

In [None]:
df_grades.index.names

In [None]:
# Getting the column names
df_grades.columns

In [None]:
# Giving the columns a name
df_grades.columns.name = 'Students'


In [None]:
df_grades.columns.name

In [None]:
df_grades

## Constructing DataFrames form 2D-arrays



In [None]:
d = np.array([np.arange(5).T, np.arange(5, 26, 5)]).T
d

In [None]:
pd.DataFrame(d, index = [*range(1, 6)],  columns = ['First', 'Second'])

#### Note: 

  - There are other possibile data inputs to construct DataFrames from. consult the online documentation or a specified book. 
  
[__Python for data analysis__](https://www.oreilly.com/library/view/python-for-data/9781491957653/) by Wesly Mckinney is a good book in this matter. 