# DATA FRAME

### Introduction 
___

In R, a DataFrame is a fundamental data structure used for storing tabular data. It is similar to a table in a relational database or a spreadsheet in other software. A DataFrame organizes data into rows and columns, where each column can contain different types of data (numeric, character, factor, etc.). DataFrames are widely used in data manipulation, exploration, and analysis tasks in R.

Here are some key characteristics of DataFrames in R:

1. Tabular Structure: DataFrames are two-dimensional structures with rows and columns. Rows typically represent individual observations or cases, while columns represent variables or attributes. Each column should contain the same number of data items.

2. Heterogeneous Data Types: Unlike matrices, which can only contain elements of the same data type, DataFrames can accommodate columns with different data types. The data frame can hold the data which can be a numeric, character or of factor type.

3. Column Names: Each column in a DataFrame has a name that identifies the variable it represents. These names can be used to access and manipulate columns. The column names should be non-empty.

4. Row Indexing: Each row in a DataFrame has an index or row label that identifies the observation. The row index can be numeric or character-based.The row names should be unique.

5. Built-in Functions: DataFrames support a wide range of built-in functions for data manipulation, including subsetting, filtering, merging, and summarizing data.


In R, you can create a DataFrame using the `data.frame()` function. The syntax for creating a DataFrame in R is as follows:

```R
info <- data.frame(name = c("a", "b"), age = c(23, 24)..., row.names = NULL, check.rows = FALSE, check.names = TRUE, stringsAsFactors = default.stringsAsFactors())
```

Many data input functions of R like, `read.table()`, `read.csv()`, `read.delim()`, `read.fwf()` also read data into a data frame.

The code you provided defines a data frame in R using the data.frame() function. Let's break down each part:

____

`data.frame( ... )`

This is the function used to create a data frame in R. A data frame is a tabular structure that can hold different data types (characters, numbers, etc.) in columns.

```R
name = c("a", "b"), age = c(23, 24)
```
These are the arguments passed to the `data.frame()` function. They define the columns of the data frame:
- `name`: This creates a column named "name" containing character values "a" and "b".
- `age`: This creates a column named "age" containing numeric values 23 and 24.

The ellipsis (...) indicates that there could be additional arguments provided to define more columns in the data frame. You can follow the same format `(column_name = c(values))` for each additional column.

_____

`row.names = NULL`

This argument specifies how to handle row names in the data frame.  

* By default, row names are automatically generated as row numbers (1, 2, ...).

* Setting `row.names = NULL` explicitly removes any row names from the data frame.

____

`check.rows = FALSE`

This argument controls whether R checks for duplicate row names. 

* By default `(check.rows = TRUE)`, R will raise an error if there are duplicate row names.

* Setting `check.rows = FALSE` disables this check, but it's generally recommended to keep it enabled unless you have a specific reason to disable it (be cautious of unintended consequences).

____

`check.names = TRUE`

This argument controls whether R checks for valid variable names.  
* By default `(check.names = TRUE)`, R will ensure variable names (column names) are valid and don't contain special characters or start with numbers.

* Setting `check.names = FALSE` disables this check, but it's advisable to keep it enabled for consistency and to avoid potential errors.
____

`stringsAsFactors` = default.stringsAsFactors()

This argument determines how character data in the data frame is treated.  

* In older R versions (prior to 4.0), the default behavior was to convert character columns to factors (a special data type in R for categorical data).

* In newer R versions (4.0 and later), the default behavior is to keep character columns as character vectors.
`default.stringsAsFactors()` returns the current default behavior for your R version.

In [6]:
my_data_frame <- data.frame(
  name = c("Alice", "Bob"),
  age = c(30, 25),
  row.names = c("A", "B")
)

print(my_data_frame)


   name age
A Alice  30
B   Bob  25


In [15]:
# Create a DataFrame with three columns: "Name", "Age", "Gender"
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 28),
  Gender = c("Female", "Male", "Male"),
  row.names = c("101", "102", "103"),
  stringsAsFactors = FALSE  # Avoid converting strings to factors
)

# Display the DataFrame
print(df)


       Name Age Gender
101   Alice  25 Female
102     Bob  30   Male
103 Charlie  28   Male


We can check if a variable is a data frame or not using the `class()` function.

In [52]:
# check the structure of the data frame
print(class(df))
print(class(my_data_frame))

[1] "data.frame"
[1] "data.frame"


### R Data Frame Operations
___

In this section of the R data frame, we will perform various operations on the data frame in R. So, let’s discuss these operations one by one:

1. Create Data Frame
___

In [6]:
#creating a data frame with column names
employee_data <- data.frame(
  employee_id = c(1:5),
  employee_name = c("James", "Harry", "Shinji", "Jim", "Oliver"),
  sal = c(642.3, 535.2, 681.0, 739.0, 925.26),
  join_date = as.Date(
    c("2013-02-04",
      "2017-06-21",
      "2012-11-14",
      "2018-05-19",
      "2016-03-25")
  ),
  stringsAsFactors = FALSE
)

print(employee_data)

  employee_id employee_name    sal  join_date
1           1         James 642.30 2013-02-04
2           2         Harry 535.20 2017-06-21
3           3        Shinji 681.00 2012-11-14
4           4           Jim 739.00 2018-05-19
5           5        Oliver 925.26 2016-03-25


This R code creates a data frame named `employee_data` and then prints it.

A data frame is a table-like data structure available in R. It's similar to a spreadsheet or SQL table, or a dictionary of Series objects in pandas. Data frames are the most commonly used data structure in data analysis.

The `data.frame` function is used to create the data frame. Inside this function, four vectors are defined, each representing a column of the data frame:

- `employee_id` is a sequence of integers from 1 to 5.
- `employee_name` is a character vector containing five names.
- `sal` is a numeric vector containing five salary values.
- `join_date` is a vector of dates. The `as.Date` function is used to convert the character strings to date objects.

The `c` function is used to create each vector. This function combines its arguments into a vector.

The `stringsAsFactors = FALSE` argument is used to prevent R from converting string vectors into factors. By default, R converts string vectors to factors (a data type used to store categorical data), but this behavior is not always desired, so it's turned off here.

Finally, the `print` function is used to display the `employee_data` data frame. This will print the data frame to the console in a tabular format.

In [21]:
# Setting up of our data. Note that all the data have the same length
temp <- c(20.37, 18.56, 18.4, 21.96, 29.53, 28.16,
          36.38, 36.62, 40.03, 27.59, 22.15, 19.85)
humidity <- c(88, 86, 81, 79, 80, 78,
              71, 69, 78, 82, 85, 83)
rain <- c(72, 33.9, 37.5, 36.6, 31.0, 16.6,
          1.2, 6.8, 36.8, 30.8, 38.5, 22.7)
month <- c("January", "February", "March", "April", "May", "June",
           "July", "August", "September", "October", "November", "December")

In [23]:
# To create a data frame, we use the data.frame() function

data <- data.frame(
  month = month,
  temperature = temp,
  humidity = humidity,
  rain = rain
)

# Display the data frame
names(data) # Names of the variables (columns)

# Displaying the first few rows of the data frame
head(data)

Unnamed: 0_level_0,month,temperature,humidity,rain
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
1,January,20.37,88,72.0
2,February,18.56,86,33.9
3,March,18.4,81,37.5
4,April,21.96,79,36.6
5,May,29.53,80,31.0
6,June,28.16,78,16.6


Second, you could make use of the `summary` function that will return a statistical summary of the variables (columns) of the dataset.

In [24]:
# Summary statistics of the data frame
summary(data)

    month            temperature       humidity         rain      
 Length:12          Min.   :18.40   Min.   :69.0   Min.   : 1.20  
 Class :character   1st Qu.:20.24   1st Qu.:78.0   1st Qu.:21.18  
 Mode  :character   Median :24.87   Median :80.5   Median :32.45  
                    Mean   :26.63   Mean   :80.0   Mean   :30.37  
                    3rd Qu.:31.24   3rd Qu.:83.5   3rd Qu.:36.98  
                    Max.   :40.03   Max.   :88.0   Max.   :72.00  

2. Get the Structure of the R Data Frame
___
The structure of the data frame can see by using the `str ()` function.

In [7]:
# Display the structure of the DataFrame
str(employee_data)

'data.frame':	5 obs. of  4 variables:
 $ employee_id  : int  1 2 3 4 5
 $ employee_name: chr  "James" "Harry" "Shinji" "Jim" ...
 $ sal          : num  642 535 681 739 925
 $ join_date    : Date, format: "2013-02-04" "2017-06-21" ...


3. How to Access Components of a Data Frame?
____
We can use either `[`, `[[`or `$` operator to access columns of data frame.

In [8]:
x <- data.frame(
  "SN" = 1:2,
  "Age" = c(21, 15),
  "Name" = c("John", "Dora"),
  stringsAsFactors = FALSE
)

# access the "Name" column using different methods
print(x["Name"])
print(x$Name)
print(x[["Name"]])
print(x[[3]])

  Name
1 John
2 Dora
[1] "John" "Dora"
[1] "John" "Dora"
[1] "John" "Dora"


After creating the data frame, the code demonstrates four different methods to access the "Name" column:

1. `x["Name"]`: This method returns a data frame that includes the "Name" column.

2. `x$Name`: This method returns a vector that contains the values of the "Name" column. The $ operator is used to access the variables in the data frame.

3. `x[["Name"]]`: This method also returns a vector that contains the values of the "Name" column. The double square brackets [[ ]] are used to access the elements of a list or a data frame.

4. `x[[3]]`: This method returns a vector that contains the values of the third column in the data frame, which is the "Name" column in this case.


3. Extract data from Data Frame
___
By using the name of the column, extract a specific column from the column.

In [9]:
# extract the employee_name and employee_id columns.
output1 <- data.frame(employee_data$employee_name, employee_data$employee_id)

print(output1)

  employee_data.employee_name employee_data.employee_id
1                       James                         1
2                       Harry                         2
3                      Shinji                         3
4                         Jim                         4
5                      Oliver                         5


In [10]:
# Extract first two rows of the DataFrame
output2 <- employee_data[1:2, ]
print(output2)


  employee_id employee_name   sal  join_date
1           1         James 642.3 2013-02-04
2           2         Harry 535.2 2017-06-21


In [11]:
# Extract the first three row of the DataFrame
output3 <- employee_data[1:3, ]
print(output3)

  employee_id employee_name   sal  join_date
1           1         James 642.3 2013-02-04
2           2         Harry 535.2 2017-06-21
3           3        Shinji 681.0 2012-11-14


In [12]:
# Extract the first two columns of the DataFrame
output4 <- employee_data[, 1:2]
print(output4)

  employee_id employee_name
1           1         James
2           2         Harry
3           3        Shinji
4           4           Jim
5           5        Oliver


In [13]:
# Extract the first three columns of the DataFrame
output5 <- employee_data[, 1:3]
print(output5)

  employee_id employee_name    sal
1           1         James 642.30
2           2         Harry 535.20
3           3        Shinji 681.00
4           4           Jim 739.00
5           5        Oliver 925.26


In [14]:
# Extract 1st and 2nd row with the 2rd and 3rr column of the below data.
output6 <- employee_data[1:2, 2:3]
print(output6)

  employee_name   sal
1         James 642.3
2         Harry 535.2


In [37]:
# Extract the 2nd and 3rd row with the 1st and 3rd column of the below data.
output7 <- employee_data[2:3, c(1, 3)]
print(output7)

  employee_id   sal
2           2 535.2
3           3 681.0


4. Expand R Data Frame
____
A data frame can be expanded by adding columns and rows.


In [15]:
# Add Column to DataFrame
employee_data$dept <- c("HR", "Finance", "IT", "IT", "Finance")
employee_data_new <- employee_data
print(employee_data_new)

  employee_id employee_name    sal  join_date    dept
1           1         James 642.30 2013-02-04      HR
2           2         Harry 535.20 2017-06-21 Finance
3           3        Shinji 681.00 2012-11-14      IT
4           4           Jim 739.00 2018-05-19      IT
5           5        Oliver 925.26 2016-03-25 Finance


In [16]:
# add a new row to the DataFrame
new_row <- data.frame(
  employee_id = 6,
  employee_name = "John",
  sal = 600.0,
  join_date = as.Date("2014-09-15"),
  dept = "HR"
)

# Append the new row to the DataFrame
employee_data_new <- rbind(employee_data_new, new_row)

# Display the updated DataFrame
print(employee_data_new)

  employee_id employee_name    sal  join_date    dept
1           1         James 642.30 2013-02-04      HR
2           2         Harry 535.20 2017-06-21 Finance
3           3        Shinji 681.00 2012-11-14      IT
4           4           Jim 739.00 2018-05-19      IT
5           5        Oliver 925.26 2016-03-25 Finance
6           6          John 600.00 2014-09-15      HR


In [17]:
# Create the second R data frame
employee_new_data <- data.frame(
  employee_id = c(6:8),
  employee_name = c("Aman", "Piyush", "Aakash"),
  sal = c(523.0, 721.3, 622.8),
  join_date = as.Date(c("2015-06-22", "2016-04-30", "2011-03-17")),
  dept = c("HR", "Finance", "IT"),
  stringsAsFactors = FALSE
)

In [19]:
# Append the second data frame to the first data frame
update_employee_data <- rbind(employee_data_new, employee_new_data)

head(update_employee_data)

Unnamed: 0_level_0,employee_id,employee_name,sal,join_date,dept
Unnamed: 0_level_1,<dbl>,<chr>,<dbl>,<date>,<chr>
1,1,James,642.3,2013-02-04,HR
2,2,Harry,535.2,2017-06-21,Finance
3,3,Shinji,681.0,2012-11-14,IT
4,4,Jim,739.0,2018-05-19,IT
5,5,Oliver,925.26,2016-03-25,Finance
6,6,John,600.0,2014-09-15,HR


5. Remove Rows and Columns
___
Use the `c()` function with negative  to remove rows and columns in a Data Frame:

In [25]:
# deleting first row and column from the data frame
data <- data.frame(
  Training = c("Strength", "Stamina", "Other"),
  Pulse = c(100, 150, 120),
  Duration = c(60, 30, 45)
)

# Remove the first row and column
data_update <- data[-c(1), -c(1)]

# Print the new data frame
data_update

Unnamed: 0_level_0,Pulse,Duration
Unnamed: 0_level_1,<dbl>,<dbl>
2,150,30
3,120,45


In [29]:
# Deleting third row and second column from the data frame
# Create the second R data frame
employee_data <- data.frame(
  employee_id = c(6:8),
  employee_name = c("Aman", "Piyush", "Aakash"),
  sal = c(523.0, 721.3, 622.8),
  join_date = as.Date(c("2015-06-22", "2016-04-30", "2011-03-17")),
  dept = c("HR", "Finance", "IT"),
  stringsAsFactors = FALSE
)

employee_data

# Remove the third row and second column

employee_id,employee_name,sal,join_date,dept
<int>,<chr>,<dbl>,<date>,<chr>
6,Aman,523.0,2015-06-22,HR
7,Piyush,721.3,2016-04-30,Finance
8,Aakash,622.8,2011-03-17,IT


In [32]:
# Delete the third row
employee_data_update <- employee_data[-c(3), ]

# calling the data frame
employee_data_update

Unnamed: 0_level_0,employee_id,employee_name,sal,join_date,dept
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<date>,<chr>
1,6,Aman,523.0,2015-06-22,HR
2,7,Piyush,721.3,2016-04-30,Finance


### Accessing R DataFrame like a matrix
___

Data frames can be accessed like a matrix by providing indexes for row and column.

To illustrate this, we use datasets already available in R. Datasets that are available can be listed with the command `library(help = "datasets")` or `data()`

In [65]:
data()

Data sets in package 'datasets':

AirPassengers           Monthly Airline Passenger Numbers 1949-1960
BJsales                 Sales Data with Leading Indicator
BJsales.lead (BJsales)
                        Sales Data with Leading Indicator
BOD                     Biochemical Oxygen Demand
CO2                     Carbon Dioxide Uptake in Grass Plants
ChickWeight             Weight versus age of chicks on different diets
DNase                   Elisa assay of DNase
EuStockMarkets          Daily Closing Prices of Major European Stock
                        Indices, 1991-1998
Formaldehyde            Determination of Formaldehyde
HairEyeColor            Hair and Eye Color of Statistics Students
Harman23.cor            Harman Example 2.3
Harman74.cor            Harman Example 7.4
Indometh                Pharmacokinetics of Indomethacin
InsectSprays            Effectiveness of Insect Sprays
JohnsonJohnson          Quarterly Earnings per Johnson & Johnson Share
LakeHuron               Level 

___

There are two main ways to access already available datasets in R:

1. Built-in Datasets:

R comes with a collection of built-in datasets that you can access directly using the `data()` function. These datasets cover various domains and can be a good starting point for learning data analysis techniques.

Here's how to access them:

* List available datasets: Type `data()` at the R prompt. This will display a list of all the built-in datasets available in your R installation.

* Load a specific dataset: Use the `data("dataset_name")` function, replacing `"dataset_name"` with the actual name of the dataset you want to load. For example, `data(mtcars)` will load the famous `"mtcars"` dataset containing car mileage information.

2. External Datasets:

Many datasets are available online from various sources like government agencies, research institutions, and public repositories. To access these datasets, you'll typically need to download them in a format compatible with R (e.g., CSV, Excel) and then use specific functions to read them into your R environment.

Here's a general approach for using external datasets:

* Download the dataset: Locate and download the dataset you're interested in from a reliable source.

* Identify the format: Check the downloaded file format (e.g., CSV, Excel (.xlsx), tab-delimited (.txt)).

* Use appropriate read function: Depending on the format, you can use functions like:

`read.csv()`: For comma-separated values (CSV) files.  

`read.table()`: For tab-delimited text files.  

`read.xlsx()`: For Excel files (requires additional packages like readxl).  

`Load the data`: Use the chosen function, specifying the file path and any necessary arguments like delimiter (for CSV) or  

`header row information`. For example, my_data <- read.csv("my_data.csv") would read the "my_data.csv" file.  

By following these approaches, you can leverage both built-in and external datasets for your data analysis tasks in R.
____

We will use the trees dataset which contains `Girth`, `Height` and `Volume` for Black Cherry Trees.

A data frame can be examined using functions like `str()` and `head()`.

In [13]:
# Accessing the inbuilt data frame trees
trees <- data.frame(
  Girth = c(8.3, 8.6, 8.8, 10.5, 10.7, 10.8, 11, 11, 11.1, 11.2),
  Height = c(70, 65, 63, 72, 81, 83, 66, 75, 80, 75),
  Volume = c(10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9)
)

# print the structure of trees
str(trees)

# display the first 3 rows of trees
head(trees, n = 10)

'data.frame':	10 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2
 $ Height: num  70 65 63 72 81 83 66 75 80 75
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9


Unnamed: 0_level_0,Girth,Height,Volume
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,8.3,70,10.3
2,8.6,65,10.3
3,8.8,63,10.2
4,10.5,72,16.4
5,10.7,81,18.8
6,10.8,83,19.7
7,11.0,66,15.6
8,11.0,75,18.2
9,11.1,80,22.6
10,11.2,75,19.9


We can see that trees are a data frame with 10 rows and 3 columns. We also display the total 10 rows (n = 10) of the data frame.

Now we proceed to access the data frame like a matrix.

In [14]:
trees <- data.frame(
  Girth = c(8.3, 8.6, 8.8, 10.5, 10.7, 10.8, 11, 11, 11.1, 11.2),
  Height = c(70, 65, 63, 72, 81, 83, 66, 75, 80, 75),
  Volume = c(10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9)
)

# select rows 2 and 3 of trees
trees[2:3, ]

# select rows with Height greater than 82
trees[trees$Height > 82, ]

# select the Height column of rows 10 to 12
trees[10:12, "Height"]

Unnamed: 0_level_0,Girth,Height,Volume
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
2,8.6,65,10.3
3,8.8,63,10.2


Unnamed: 0_level_0,Girth,Height,Volume
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
6,10.8,83,19.7


### How to modify a Data Frame in R?
___

Data frames can be modified like we modified matrices through reassignment.

In [20]:
x <- data.frame(
  SN = c(1, 2),
  Age = c(21, 15),
  Name = c("John", "Dora")
)

# print the initial data frame
print(x)

# update the Age value in the first row to 20
x[1, "Age"] <- 20

# print the updated data frame
print(x)

  SN Age Name
1  1  21 John
2  2  15 Dora
  SN Age Name
1  1  20 John
2  2  15 Dora


In [21]:
# Rows can be added to a data frame using the rbind() function.

x <- data.frame(
  SN = c(1, 2),
  Age = c(20, 15),
  Name = c("John", "Dora")
)

# print the initial data frame
print(x)

# create a new row and bind it to the data frame
new_row <- list(SN = 1, Age = 16, Name = "Paul")
x <- rbind(x, new_row)

# print the updated data frame
print(x)

  SN Age Name
1  1  20 John
2  2  15 Dora
  SN Age Name
1  1  20 John
2  2  15 Dora
3  1  16 Paul


In [22]:
# Similarly, we can add columns using cbind().
x <- data.frame(
  SN = c(1, 2),
  Age = c(20, 15),
  Name = c("John", "Dora")
)

# print the initial data frame
print(x)

# add a new column "State" to the data frame using cbind()
x <- cbind(x, State = c("NY", "FL"))

# print the updated data frame
print(x)

  SN Age Name
1  1  20 John
2  2  15 Dora
  SN Age Name State
1  1  20 John    NY
2  2  15 Dora    FL


### Deleting Component of Data Frame
___

Data frame columns can be deleted by assigning `NULL` to it.

In [23]:
x <- data.frame(
  SN = c(1, 2),
  Age = c(20, 15),
  Name = c("John", "Dora"),
  State = c("NY", "FL")
)

# print the initial data frame
print(x)

# remove the "State" column from the data frame
x$State <- NULL

# print the updated data frame
print(x)

  SN Age Name State
1  1  20 John    NY
2  2  15 Dora    FL
  SN Age Name
1  1  20 John
2  2  15 Dora


### Remove Rows and Columns

Use the`c()` function to remove rows and columns in a Data Frame:

In [1]:
Data_Frame <- data.frame (
  Training = c("Strength", "Stamina", "Other"),
  Pulse = c(100, 150, 120),
  Duration = c(60, 30, 45)
)

# Remove the first row and column
Data_Frame_New <- Data_Frame[-c(1), -c(1)]

# Print the new data frame
Data_Frame_New

Unnamed: 0_level_0,Pulse,Duration
Unnamed: 0_level_1,<dbl>,<dbl>
2,150,30
3,120,45
