# Intro

## Before You Start

### Requirements
[Jupyter](https://jupyter.org/install)
    
### Installing and Running
This code may be run in an R environment in Jupyter notebooks or any terminal that runs R. 

### Website
   https://maneuver-id.mit.edu/
### About
   The U.S. Air Force released a dataset from Pilot Training Next (PTN) through the AI Accelerator of Air Force pilots and trainees flying in virtual reality simulators. In an effort to enable AI coaching and automatic maneuver grading in pilot training, the Air Force seeks to automatically identify and label each maneuver flown in this dataset from a catalog of around 30 maneuvers. Your solution helps advance the state of the art in flying training!
### Challenges 
This notebook addresses the second challenge: identifying maneuvers in unlabeled data based on sample maneuver data. It should be run along with the Preprocessing.Rmd notebook/file.

# Loading Libraries

In order to conduct the analysis required for this portion of the lab, the following packages are required. 

In [1]:
{r setup, include=FALSE}
library(tidyverse)
set.seed(2023)
knitr::opts_chunk$set(echo = TRUE)

ERROR: Error in parse(text = x, srcfile = src): <text>:1:4: unexpected symbol
1: {r setup
       ^


# Loading the Data Sets into R

The first step to play with the data is to load in the TSV files. We begin by looping through the good and bad folders obtained with our preprocessing steps. We will navigate to each directory, identify all the .tsv files and the iteratively load them into the data frame, *df*. 

Due to the large amount of data, the code can take some time to load. Therefore, the data frame, *df* has been saved as an .RData file. The LoadedSorties.RData file is located in the Data directory and can be loaded to prevent needing to run this page of code. The LoadedSorties.RData file also handles the removal of NA values.

In [3]:
#create original data frame to store in the TSV
df<- data.frame()

#Function to load in data
load_tsv <- function(x){
  
  #load Data
  df_temp = read_delim(x, delim = "\t")
  
  #Determine sortie
  df_temp = mutate(df_temp, sortieNum = strsplit(basename(x), "[.]")[[1]][1])
  
  #determine good v bad
  if(str_detect(x, "/good_tsv/")){
    lab = "good"
  }else{
    lab = "bad"
  }

  df_temp = mutate(df_temp, label = lab)
  
  #get rid of column number and rename columns
  df_temp = df_temp[,-1]
  colnames(df_temp) = c( "timeSecond", "xEastPos", "yNorthPos", "zUpPos", "xEastVel", "yNorthVel", "zUpVel", "headingDeg", "pitchDeg", "rollDeg","sortieNum", "label")
  
  df <<- rbind(df, df_temp)
}
#set wd
setwd("../../Pipeline/Step0_Raw/data/ObservedTrajectory/12000000000_tsv_good") #Change to directory with good tsv files

#Current Directory must be Observed Trajectory Folder
files <- list.files(pattern="*.min.tsv", full.names=TRUE, recursive=TRUE)
lapply(files, load_tsv)
#clean up df
df$label <- as.factor(df$label)


ERROR: Error in read_delim(x, delim = "\t"): could not find function "read_delim"


# Handling NA Values

In [4]:
# Next is to handle NA values. Check how many NA values exist.

sum(is.na(df))
colSums(is.na(df))

#Although there is 78371 NA values, almost all are contributed to missing velocities. However, only 18 sorties are contributing to these 78371 values.
na_rows <- which(is.na(df), arr.ind = TRUE)[,1]
na_sorties <- unique(df[na_rows,]$sortieNum)
length(na_sorties)

#These 18 sorties will be removed from the data to ensure only valid data is being analyzed. 
 
df <- df[!(df$sortieNum %in% na_sorties),]


To ensure the model is training on the contents of each sortie rather than the quantity of each observation, the training and testing set will be artificial balanced. The validation set will be unbalanced and reflect the true proportion of sorties.

In addition to balancing the data set, the limited number of bad data points requires a smart choice to be made to ensure the training/testing data does not end in the validation set. To maximize the training data, 600 sorties from each group ("good" and "bad") sorties were selected and then removed from the available pool. Then 75 sorties from each group were selected for the testing set. Finally, the two groups were combined and a final 150 sorties were polled to create the validation set.

 Let us identify our sample of sorties used in the training set

# Creating Training, Testing, and Validation Sets
At this point, the dataframe *df* has been saved as LoadedSorties.RDS in the Data directory. The next step is to explore the data to determine how to best create training and testing sets. 


In [5]:
#Total Number of Sorties
length(unique(df$sortieNum))
#Number of "Good" Sorties
goodSorties <- unique(df$sortieNum[which(df$label == 'good')])
length(goodSorties)
#Number of "Bad" Sorties
badSorties <- unique(df$sortieNum[which(df$label == 'bad')])
length(badSorties)
#Proportion of total sorties that are "Good"
length(goodSorties)/length(unique(df$sortieNum))

ERROR: Error in sample.int(length(x), size, replace, prob): invalid first argument


In [None]:
training_sorties = c(sample(badSorties, 600), sample(goodSorties,600))

remainingSortiesBad <- subset(badSorties, !(badSorties %in% training_sorties))
remainingSortiesGood <- subset(goodSorties, !(goodSorties %in% training_sorties))

testing_sorties = c(sample(remainingSortiesBad, 75),sample(remainingSortiesGood, 75))

remainingSortiesBad <- subset(remainingSortiesBad, !(remainingSortiesBad %in% testing_sorties))
remainingSortiesGood <- subset(remainingSortiesGood, !(remainingSortiesGood %in% testing_sorties))

validation_sorties = c(sample(c(remainingSortiesBad, remainingSortiesGood), 150))







Now, the data from the main data frame will be extracted to create training, testing, and validation data frames.

In [1]:
training_df <- subset(df, sortieNum %in% training_sorties)
testing_df <- subset(df, sortieNum %in% testing_sorties)
validation_df <- subset(df, sortieNum %in% validation_sorties)

ERROR: Error in sortieNum %in% training_sorties: object 'sortieNum' not found


Now, a training, testing, and validation set has been achieved and can be utilized to complete task 1. To prevent having to reload data and the potential for differing sampling, these three sets have been saved in the data folder