In [64]:
import os
import cleanup_utils as clu
import processing as pro


When analyzing the data, I approach it in two distinct phases based on the rows of the data frame. This strategy is adopted due to the markedly different types and patterns observed within the data. As a result, I work with two sets of assumptions: one for the 'Head', which includes rows 1 through 20, and another for the 'Core', which encompasses the remainder of the data. 

<br>

## Head assumptions
 
<br>

1. The first few rows of each file are dedicated to setup and profiling details, pertaining to the device's software and user configuration. These specifics are not pertinent to the analysis of running performance.

2. While the 'Message' column initially offers insights into the data type, it becomes irrelevant for further analysis once the data is classified.

3. The 'Type', 'Local Number', and 'Message' columns are generally considered redundant by default.


![First row's from each recording](data_head.png "First row's from each recording")


## Core Assumptions:

<br>

1. A 'record' row includes relevant features for the project, such as time, distance, pace, heart rate, etc.
2. Rows labeled as 'unknown' or 'gps_metadata' appear to contain specific encoded data or initial GPS data, which is assumed to be encapsulated in the relevant features presented in the 'record' rows.
3. The 'unknown' rows may represent properties or device-specific information that requires further decoding and can be ignorable for performance analysis.

<br>



![First row's from each recording](body_pattern.png)


# Inner process of run_clean (pipe-line main function) 
____

In [65]:
# Function Inputs 
data_folder_path = 'Data/Before Processing'
processed_folder_path ='Data/After Processing'

# Relevant Paths List
files_names = os.listdir(data_folder_path)
files_path = [data_folder_path + '/' + file_name for file_name in files_names]

# load_and_first_digest_data is pretty straight forward after "Initial Data Analysis"

In [66]:
path = files_path[0]  
frame = clu.load_and_first_digest_data(path)
if frame.empty:
    print(f"Empty DataFrame, check encoding or records existence in pre processed data")

  data = pd.read_csv(path, encoding='utf-8', on_bad_lines='skip')


# clean_non_info_col Function

In [67]:
pro.clean_non_info_col(frame)


![Image of Titles](Titles.png)

### 1. Identify Feature Columns
- Identifies all columns that contain the word 'Field' in their headers, assuming these are feature columns.



### 2. Determine Prevalent Feature Value
- Determines the most common value using a custom function called `clu.common_feature`, which analyzes the frequency of values in the column according to a threshold.

<br>

- The threshold parameter indicates that a value must appear in at least 90% of the rows to be considered the prevalent value.
     - **General Guideline:** As learned from additional research, some researchers suggest that if more than 5% of the data is missing, the researcher should provide a detailed explanation for why the data is missing and how imputations were made. If more than 15% to 20% of the data is missing, the reliability of any imputations made becomes much more questionable. For the following general processing, threshold set to 'Guideline' lower bound mean
           - (The threshold parameter can be set in `clu.common_feature`, according to the specific data characteristics)


### 3. Conditionally Drop Columns:
- Data records are in a (Field, Value, Units) block, hence drops related 'Value' and 'Units' columns that are associated with the feature column.


# arrange_features_columns Function

In [68]:
transformed_frame = pro.arrange_features_columns(frame)



### 1. Generate Feature List:
- Using `clu.create_features_list(frame)`. This step involves identifying unique features from the original DataFrame to be used as headlines in the new one.

<br>

### 2. Extract and Organize Data:
- Constructs a new DataFrame from a dictionary, handling missing values in each row by inserting a 'None' value.

<br>

### 3. Create and Return New DataFrame:
- Performs threshold cleaning due to 'None' value "seeding" in the reconstructed frame, using `thresh=frame.shape[0]*0.9`.


# imputation Function

In [69]:
pro.imputation(transformed_frame)

Unnamed: 0,timestamp [s],position_lat [semicircles],position_long [semicircles],distance [m],enhanced_speed [m/s],enhanced_altitude [m],heart_rate [bpm]
0,1054048548,383162669,415204331.0,2.36,1.857,26.8,108.0
1,1054048549,383162885,415204129.0,4.92,2.202,26.8,108.0
2,1054048555,383164149,415202638.0,21.50,2.118,26.6,110.0
3,1054048561,383165243,415200807.0,39.22,2.697,23.8,110.0
4,1054048566,383165594,415198952.0,54.67,2.902,21.0,110.0
...,...,...,...,...,...,...,...
850,1054051690,382477431.0,414846195.0,12450.86,1.390,22.8,99.0
851,1054051692,382477453.0,414846172.0,12451.68,0.998,23.0,99.0
852,1054051698,382477284.0,414846176.0,12454.15,1.185,24.0,102.0
853,1054051700,382477284.0,414846176.0,12454.15,0.000,24.4,103.0



---
## KNN Imputation for Selected Columns

---

## Custom Imputation for Position and Altitude Columns

---

## Imputation for Distance Data
- For the 'distance' column, the function uses a calculation-based approach, utilizing time and speed data.


In [70]:
file_index = 1
if processed_folder_path:
    clu.save_to_folder(transformed_frame, "clean_frame, " + f'{file_index}', processed_folder_path)
    file_index += 1