In [1]:
import sys

# Data Science Libraries
import pandas as pd
import numpy as np
import datetime
 
# Adding Scrapwork to the system path to call wrangle
sys.path.insert(0, '/Users/student/Documents/Jeanette\'s Workspace/Mars_Weather_Prediction/Scrapwork')

# Import my own functions
import wrangle

# Block Warning Boxes
import warnings
warnings.filterwarnings("ignore")

<img src="Images/InSight_Mission_Logo.png" alt="InSight Mission Logo" title="InSight Mission Logo" width="332" height="385" align="center">
<h1 align = "center">Findings Thus Far...</h1>
<h2 align = "center">By Jeanette Schulz</h2>
<h4 align = "center">2022 JAN 24</h4>


<hr style="border:2px solid red"> </hr>

# Executive Summary
This project is unfinished. The preparation and exploration phase found this dataset very difficult; and as a novice Data Scientist I utilized my time poorly in an attempt to try and clean this data enough for modeling.

If I had more time I would collect more than 120 days worth of data from InSight's TWINS and investigate ways to fill the time gaps in the data so I had a continuous dataset. This would lead to a dataset that I could then model on and try to predict time with. With more experience in the future, I may be able to work with a time-series dataset that has time-gaps, but for now that is above my scope.



<hr style="border:2px solid red"> </hr>

# Project Goal
Using some of the sensor data collected by InSight, the goal of this project was to see if I could find patterns that would allow me to predict future weather over <b>Elysium Planitia</b> on Mars. Since Mars has no water cycle like here on Earth, I would be predicting Air Temperature and Wind Speed with direction.


<hr style="border:2px solid red"> </hr>

# Planning
The original plan was to collect the first 120 days of data from when InSight landed on Mars. This was done by data scraping a website I found with links to the raw csv data. I didn't have any knowledge in web scraping, so this was my first hurdle for this project. Once I downloaded all the csv files, I combined them into a single file that I could then access as a dataframe. This was all done using python. From here I was to prepare, explore, and finally create a model that could predict the average Air Temperature for the next day. I hoped by eliminating wind as a factor for now, I could create a MVP. 


<hr style="border:2px solid red"> </hr>

# Preparing the Data
#### I now understand why this is a majority of the time used in the pipeline for Data Science. This is where I unwisely spent 4 days of my time, trying and hoping to get a MVP on the table.

#### I started with a dataframe that had 64 columns and over two million rows of data. I could see that data was collect by the second from the TWINS on InSight. My first goal was to reduce the 64 columns to less than five if possible. It's always easier to model with less columns.

In [2]:
df = pd.read_csv('Scrapwork/INSIGHT_TWINS_RAW.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,AOBT,SCLK,LMST,LTST,UTC,BMY_2L_TEMP_1,BMY_2L_TEMP_2,BMY_2L_TEMP_3,BMY_2L_TEMP_4,...,BPY_WD_OUT_6,BPY_WD_OUT_7,BPY_WD_OUT_8,BPY_WD_OUT_9,BPY_WD_OUT_10,BPY_WD_OUT_11,BPY_WD_OUT_12,BPY_WIND_FREQUENCY,BPY_AIR_TEMP_FREQUENCY,BPY_ASIC_TEMP
0,0,596876952.0,596861200.0,00004M06:46:33.826,00004 06:05:41,2018-334T14:46:55.755Z,-4353.0,-4645.0,-4778.0,-5006.0,...,,,,,,,,,,
1,1,596876953.0,596861200.0,00004M06:46:34.799,00004 06:05:42,2018-334T14:46:56.755Z,-4415.0,-4723.0,-4870.0,-5099.0,...,,,,,,,,,,
2,2,596876954.0,596861200.0,00004M06:46:35.772,00004 06:05:43,2018-334T14:46:57.755Z,-4424.0,-4727.0,-4861.0,-5107.0,...,,,,,,,,,,
3,3,596876955.0,596861200.0,00004M06:46:36.746,00004 06:05:44,2018-334T14:46:58.755Z,-4409.0,-4722.0,-4859.0,-5109.0,...,,,,,,,,,,
4,4,596876956.0,596861200.0,00004M06:46:37.719,00004 06:05:45,2018-334T14:46:59.755Z,-4424.0,-4728.0,-4873.0,-5107.0,...,,,,,,,,,,


In [3]:
df.shape

(2350436, 64)

#### To start eliminating rows, I first had to learn what each row of data represented. I had no domain knowledge, so even after I found what the column's abbreviation was, I had to do some more research to understand what it meant. For some columns, I never really found an answer. I.e. What is "cold die temperature", and how is it different from air temperature?
  
#### It was here I also noticed that data collected was divided by the TWINS arm it was collected on. I was to quickly realize that InSight basically had two "arms" collecting weather data, but not at the same time. The "right" and "left" side of TWINS alternated data collection, unless in the event of a unqiue occurance like a dust devil. 

## TWINS Raw Data Columns
|System          |  # | Column  |Data Type              | Description                                         |
|:---------------|:---|:--------|:----------------------|:----------------------------------------------------|
|Time References |  1 | AOBT    |ASCII_Real             | APSS Onboard Time
|                |  2 | SCLK    |ASCII_Real             | Spacecraft Clock
|                |  3 | LMST    |ASCII_String           | Local Mean Solar Time
|                |  4 | LTST    |ASCII_String           | Local True Solar Time
|                |  5 | UTC     |ASCII_Date_Time_DOY_UTC| Coordinated Universal Time
|================|====|=========|=======================|=====================================================|
| BOOM -Y        |  6 | BMY_2L_TEMP_1 | ASCII_Integer | Wind Sensor transducer 1 Printed Circuit Board temperature PT-1000 Platinum Resistance Thermometer |
|                |  7 | BMY_2L_TEMP_2 | ASCII_Integer | WS transducer 2 PCB temperature PT-1000 PRT | 
|                |  8 | BMY_2L_TEMP_3 | ASCII_Integer | WS transducer 3 PCB temperature PT-1000 PRT |
|                |  9 | BMY_2L_TEMP_4 | ASCII_Integer | ATS-mid-rodtemperature: PT1000 PRT sensor located at an intermediate position in the ATS rod|
|                | 10 | BMY_2L_TEMP_4_AVERAGE | ASCII_Integer | ATS-mid-rod temperature average of the last N samples
|                | 11 | BMY_2L_TEMP_4_STD     | ASCII_Integer | ATS-mid-rod temperature standard deviation of the last N samples
|                | 12 | BMY_2L_TEMP_5         | ASCII_Integer | Boom Housing Temp: PT-1000 PRT located at the Boom housing near the base of the ATS rod
|                | 13 | BMY_2L_TEMP_6         | ASCII_Integer | Calibration resistor: 1K ohm
|                | 14 | BMY_AIR_TEMP | ASCII_Integer | ATS-rod-extreme temperature: PT1000 PRT located at ATS extreme
|                | 15 | BMY_AIR_TEMP_AVERAGE  | ASCII_Integer | ATS-rod-extreme temperature average of the last N samples
|                | 16 | BMY_AIR_TEMP_STD | ASCII_Integer | ATS-rod-extreme temperature standard deviation of the last N samples
|                | 17 | BMY_WD_REF_OUT_1 | ASCII_Integer | WS transducer 1 cold die temperature
|                | 18 | BMY_WD_REF_OUT_2 | ASCII_Integer | WS transducer 2 cold die temperature
|                | 19 | BMY_WD_REF_OUT_3 | ASCII_Integer | WS transducer 3 cold die temperature
|                | 20 | BMY_WD_OUT_1     | ASCII_Integer | Number of counts measured for WS channel 1
|                | 21 | BMY_WD_OUT_2     | ASCII_Integer | Number of counts measured for WS channel 2
|                | 22 | BMY_WD_OUT_3     | ASCII_Integer | Number of counts measured for WS channel 3
|                | 23 | BMY_WD_OUT_4     | ASCII_Integer | Number of counts measured for WS channel 4
|                | 24 | BMY_WD_OUT_5     | ASCII_Integer | Number of counts measured for WS channel 5
|                | 25 | BMY_WD_OUT_6     | ASCII_Integer | Number of counts measured for WS channel 6
|                | 26 | BMY_WD_OUT_7     | ASCII_Integer | Number of counts measured for WS channel 7
|                | 27 | BMY_WD_OUT_8     | ASCII_Integer | Number of counts measured for WS channel 8
|                | 28 | BMY_WD_OUT_9     | ASCII_Integer | Number of counts measured for WS channel 9
|                | 29 | BMY_WD_OUT_10    | ASCII_Integer | Number of counts measured for WS channel 10
|                | 30 | BMY_WD_OUT_11    | ASCII_Integer | Number of counts measured for WS channel 11
|                | 31 | BMY_WD_OUT_12 | ASCII_Integer | Number of counts measured for WS channel 12
|                | 32 | BMY_ASIC_TEMP | ASCII_Integer | ASIC temperature
|                | 33 | BMY_AIR_TEMP_FREQUENCY | ASCII_String | Air temperature channels frequency or frequencies
|                | 34 | BMY_WIND_FREQUENCY | ASCII_String | Wind channels frequency or frequencies
|================|====|=========|=======================|=====================================================|
| BOOM +Y        | 35 | BPY_2L_TEMP_1 | ASCII_Integer | WS transducer 1 PCB temperature PT-1000 PRT
|                | 36 | BPY_2L_TEMP_2 | ASCII_Integer | WS transducer 2 PCB temperature PT-1000 PRT
|                | 37 | BPY_2L_TEMP_3 | ASCII_Integer | WS transducer 3 PCB temperature PT-1000 PRT
|                | 38 | BPY_2L_TEMP_4 | ASCII_Integer | Calibration resistor: 1K ohm
|                | 39 | BPY_2L_TEMP_5 | ASCII_Integer | ATS-mid-rod temperature: PT1000 PRT sensor located at a intermediate position in the ATS rod
|                | 40 | BPY_2L_TEMP_5_AVERAGE | ASCII_Integer | ATS-mid-rod temperature average of the last N samples
|                | 41 | BPY_2L_TEMP_5_STD | ASCII_Integer | ATS-mid-rod temperature standard deviation of the last N samples
|                | 42 | BPY_2L_TEMP_6 | ASCII_Integer | Boom Housing Temp: PT-1000 PRT located at the Boom housing near the base of the ATS rod
|                | 43 | BPY_AIR_TEMP | ASCII_Integer | ATS-rod-extreme temperature: PT1000 PRT located at ATS extreme
|                | 44 | BPY_AIR_TEMP_AVERAGE | ASCII_Integer | ATS-rod-extreme temperature average of the last N samples
|                | 45 | BPY_AIR_TEMP_STD | ASCII_Integer | ATS-rod-extreme temperature standard deviation of the last N samples
|                | 46 | BPY_WD_REF_OUT_1 | ASCII_Integer | WS transducer 1 cold die temperature
|                | 47 | BPY_WD_REF_OUT_2 | ASCII_Integer | WS transducer 2 cold die temperature
|                | 48 | BPY_WD_REF_OUT_3 | ASCII_Integer | WS transducer 3 cold die temperature
|                | 49 | BPY_WD_OUT_1     | ASCII_Integer | Number of counts measured for WS channel 1
|                | 50 | BPY_WD_OUT_2     | ASCII_Integer | Number of counts measured for WS channel 2
|                | 51 | BPY_WD_OUT_3     | ASCII_Integer | Number of counts measured for WS channel 3
|                | 52 | BPY_WD_OUT_4     | ASCII_Integer | Number of counts measured for WS channel 4
|                | 53 | BPY_WD_OUT_5     | ASCII_Integer | Number of counts measured for WS channel 5
|                | 54 | BPY_WD_OUT_6     | ASCII_Integer | Number of counts measured for WS channel 6
|                | 55 | BPY_WD_OUT_7     | ASCII_Integer | Number of counts measured for WS channel 7
|                | 56 | BPY_WD_OUT_8     | ASCII_Integer | Number of counts measured for WS channel 8
|                | 57 | BPY_WD_OUT_9     | ASCII_Integer | Number of counts measured for WS channel 9
|                | 58 | BPY_WD_OUT_10    | ASCII_Integer | Number of counts measured for WS channel 10
|                | 59 | BPY_WD_OUT_11    | ASCII_Integer | Number of counts measured for WS channel 11
|                | 60 | BPY_WD_OUT_12    | ASCII_Integer | Number of counts measured for WS channel 12
|                | 61 | BPY_ASIC_TEMP    | ASCII_Integer | ASIC temperature
|                | 62 | BPY_AIR_TEMP_FREQUENCY | ASCII_String |Air temperature channels frequency or frequencies
|                | 63 | BPY_WIND_FREQUENCY | ASCII_String | Wind channels frequency or frequencies

#### Taking the time to create a data dictionary helped me discover the redundancies NASA has in place. For example, the 12 channels collecting wind speed. I decided the best way to minimize my columns was to take the average of all the repetitive columns.
  
#### At first, I also kept the TWIN's different arms separate to make sure each side got proper representation when I would later want to average the two together.

In [4]:
# Average of the BMY_2L_TEMP's
df['BMY_2L_TEMP_AVG'] = (df.BMY_2L_TEMP_1 + df.BMY_2L_TEMP_2 + df.BMY_2L_TEMP_3 + df.BMY_2L_TEMP_4 + df.BMY_2L_TEMP_5)/5

# Average of the BMY_WD_REF_OUT's
df['BMY_WD_REF_OUT_AVG'] = (df.BMY_WD_REF_OUT_1 + df.BMY_WD_REF_OUT_2 + df.BMY_WD_REF_OUT_3)/3

# Pop all BMY temperature stuff into a dataframe
BMY_simple_df = pd.concat([df.BMY_2L_TEMP_AVG, # temperature
                           df.BMY_AIR_TEMP, # temperature
                           df.BMY_WD_REF_OUT_AVG, # temperature
                           df.BMY_ASIC_TEMP # temperature
                          ], axis=1)

BMY_simple_df.head()

Unnamed: 0,BMY_2L_TEMP_AVG,BMY_AIR_TEMP,BMY_WD_REF_OUT_AVG,BMY_ASIC_TEMP
0,-4658.8,-5703.0,5045.0,8485.0
1,-4741.4,-5776.0,4968.666667,8488.0
2,-4746.4,-5803.0,4977.666667,8491.0
3,-4740.4,-5771.0,4982.333333,8494.0
4,-4748.4,-5754.0,4981.333333,8498.0


In [5]:
# Average of BPY_2L_TEMP's
df['BPY_2L_TEMP_AVG'] = (df.BPY_2L_TEMP_1 + df.BPY_2L_TEMP_2 + df.BPY_2L_TEMP_3 + df.BPY_2L_TEMP_5 + df.BPY_2L_TEMP_6)/5

# Average of BPY_WD_REF_OUT's
df['BPY_WD_REF_OUT_AVG'] = (df.BPY_WD_REF_OUT_1 + df.BPY_WD_REF_OUT_2 + df.BPY_WD_REF_OUT_3)/3

# Group all BPY temperature stuff into a dataframe
BPY_simple_df = pd.concat([df.BPY_2L_TEMP_AVG, # temperature
                           df.BPY_AIR_TEMP, # temperature
                           df.BPY_WD_REF_OUT_AVG, # temperature
                           df.BPY_ASIC_TEMP # temperature
                          ], axis=1)
BPY_simple_df[95:100]

Unnamed: 0,BPY_2L_TEMP_AVG,BPY_AIR_TEMP,BPY_WD_REF_OUT_AVG,BPY_ASIC_TEMP
95,-2376.6,-2238.0,10322.0,8886.0
96,-2365.0,-2216.0,10323.333333,8892.0
97,-2357.6,-2202.0,10324.333333,8895.0
98,-2360.6,-2227.0,10319.333333,8900.0
99,-2342.0,-2221.0,10334.666667,8904.0


### Combining BMY and BPY
#### Since the two arms alternated data collection I needed to know if they had any overlap before combining them.

In [6]:
df[['BMY_2L_TEMP_AVG', 'BPY_2L_TEMP_AVG']].isnull().sum(axis=1).value_counts()

2    1251752
1     782775
0     315909
dtype: int64

True + False = 1: There is only one column that has data (is not null)  
False + False = 0: Both columns have data, (niether is null)  
True + True = 2: Neither column has any recorded data (both are null)

#### The takeaway from this is that I had three hundred thousand rows that were completely null, over seven hundred thousand rows where the two arms collected data at the same time, and over a million rows of only one arm collecting data. The function I built accounted for these possibilities and returned a single row that combined the two sides of TWINS.

In [7]:
def combine_TWINS(BMY_array, BPY_array):
    # Create the list to catch our data
    mars_temp = []

    # for x, y in zip(array1, array2):   
    for BMY, BPY in zip(BMY_array, BPY_array):

        # If both columns have a value, take the average of the two
        if (np.isnan(BMY) + np.isnan(BPY)) == 0 :
            mars_temp.append((BMY + BPY)/2)
        
        # If both columns have null value, add the first null value
        elif (np.isnan(BMY) + np.isnan(BPY)) == 2:
            mars_temp.append(BMY)

        # If BMY is only column with data, add to list
        elif np.isnan(BMY) == False:
            mars_temp.append(BMY)
        
        # If BPY is only column with data, add to list
        else:
            mars_temp.append(BPY)

    return pd.DataFrame(mars_temp)

In [8]:
# Combine TWIN's arms data on Air_TEMP
twins_air_TEMP = combine_TWINS(df.BMY_AIR_TEMP.to_numpy(), df.BPY_AIR_TEMP.to_numpy() )

In [9]:
# Creating the dataframe intended to be used for Exploration and Modeling
final_df = pd.concat([df.LTST, twins_air_TEMP], axis=1, ignore_index=True, sort=False)
final_df.rename(columns={0:'local_true_solar_time', 1:'twins_air_TEMP'}, inplace=True)
final_df.head()

Unnamed: 0,local_true_solar_time,twins_air_TEMP
0,00004 06:05:41,-5703.0
1,00004 06:05:42,-5776.0
2,00004 06:05:43,-5803.0
3,00004 06:05:44,-5771.0
4,00004 06:05:45,-5754.0


#### Now that I had my dataframe down to two columns, all I needed to do was make my time column into a datetime and replace my index with it. Unfortunately, I realized that none of the time collected by InSight was in earth time. This was a problem only because Python's 'datetime' data type uses very specific earth time in all its code. So I couldn't make the time I had into a datetime object without first converting it into earth time. For this I decided to go with "Local True Solar Time" because it looked the closest to a date and time I recognized.

In [10]:
def mars_to_earth_time(mars_date):
    mars_day = int(mars_date[:5])
    mars_hour = int(mars_date[6:8])
    mars_minute = int(mars_date[9:11])
    mars_second = int(mars_date[12:14])

    # mars_days relates to how many sol days InSight has been on Mars
    # InSight landed on Mars on November 26, 2018 2:52:59 PM EST (Earth Time)
    earth_time = landing_day = pd.to_datetime('November 26, 2018 14:52:59', format = '%B %d, %Y %H:%M:%S')

    # 1 mars day = 24 hours, 39 minutes and 35 seconds on Earth
    # For every mars day, add that total time to earth_time
    for x in range(int(mars_day)):
        earth_time += datetime.timedelta(hours= 24, minutes= 39, seconds= 35)
    
    # Finally, add the remaining hours, minutes, and seconds
    earth_time += datetime.timedelta(hours= mars_hour, minutes= mars_minute, seconds= mars_second)
    
    return earth_time

#### Once I had a datetime to put in my index, I was able to create a final prepare function to use for exploration.

In [11]:
def prepare_TWINS():
    # Read in data and drop unneeded columns
    df = pd.read_csv('MVP2.csv').reset_index(drop=True)
    df = df.drop(columns=['Unnamed: 0', 'local_true_solar_time'])

    # Converting earth_date to datetime dtype
    df.earth_date = pd.to_datetime(df.earth_date, format = '%Y-%m-%d %H:%M:%S')

    # Setting the 'earth_date' column as the Index and sorting that new Index:
    df = df.set_index('earth_date').sort_index()
    
    return df

In [15]:
df = prepare_TWINS()
df.head()

Unnamed: 0_level_0,twins_2L_TEMP,twins_air_TEMP
earth_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-11-30 23:37:00,-4658.8,-5703.0
2018-11-30 23:37:01,-4741.4,-5776.0
2018-11-30 23:37:02,-4746.4,-5803.0
2018-11-30 23:37:03,-4740.4,-5771.0
2018-11-30 23:37:04,-4748.4,-5754.0


#### I did some exploration with this dataframe and in doing so, learned that I had even more gaps in time than I expected. As of right now, I am unsure of how to adjust for the gaps in time, as they are present no matter how I resample. 