# Part1. Prototype of thermostat.
 - Create the dataset and build the first prototype for the smart thermostat that can estimate temperature at home, departure and arrival time of each family member, their geolocations and distance out of home.
 
### Content
[1. Data collection+Synthetic data](#part1)

[2. Feature engineering](#part2)
 


## 1. Data collection+Synthetic data
<a class='anchor' id = 'part1'></a>
 - First, create the simple dataset for 2022 taking into account that our family has 4 members, everyone leaves home and comes back every day. Variables are needed to collect:
   - members' names
   - timestamps for departurw and arrival time
   - temperature at home exactly at the same timestamps (departure and arrival)

In [24]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.decomposition import PCA

"""
Create a sample of data for the project 
    
"""


df = pd.DataFrame({
    "member_name": ["Elena", "Egor", "Kirill", "Elena", "Egor", "Kirill", "Nika", "Egor", "Kirill", "Elena"],
    "departure_time": ["2022-05-03 07:01", "2022-05-03 06:30", "2022-05-03 10:05", "2022-05-04 07:10",
                       "2022-05-04 08:30", "2022-05-04 06:50", "2022-05-04 06:50", "2022-05-05 08:10",
                       "2022-05-05 09:55", "2022-05-05 07:05"],
    "arrival_time": ["2022-05-03 18:04", "2022-05-03 13:30", "2022-05-03 19:05", "2022-05-04 18:20",
                       "2022-05-04 14:30", "2022-05-04 12:50", "2022-05-04 21:40", "2022-05-05 14:20",
                       "2022-05-05 20:55", "2022-05-05 18:05"],
    "temperature_depart": ["20.5", "21.0", "18.7", "21.3", "21.0", "19.5", "19.5", "18.5", "20.0", "18.0"],
    "temperature_arr": ["20.0", "17.0", "21.0", "21.3", "20.0", "18.5", "21.5", "17.5", "21.5", "20.0"]
})

members = ["Elena","Kirill","Egor","Nika"]

"""
Generate synthetic data for the project. In the famaly there are 4 members. Period of time is 2022
Take into account that all our members leave their home and come back once per day 
(Morning: departure, Afternoon/Evening: arrival)
    
"""

start_date = datetime(2022, 1, 1)
end_date = datetime(2022, 12, 31)
days = (end_date - start_date).days
all_dates = [start_date + timedelta(days=x) for x in range(days)]

# Generate synthetic data for each member
synthetic_data = pd.DataFrame(columns=["member_name", "dep_time", "arr_time","temp_depart", "temp_arr"])

for member in members:
    # Get the departure and arrival times for the current member
    member_data = df[df["member_name"] == member]
    # Convert the departure and arrival time strings to datetime objects
    departure_times = pd.to_datetime(df["departure_time"].unique())
    arrival_times = pd.to_datetime(df["arrival_time"].unique())
    
    
"""
Generate synthetic data for dates 
    
"""


# Generate synthetic data
# Generate synthetic data for dates
synthetic_data_departure = []
synthetic_data_arrival = []
for i in range(len(all_dates)):
    # Generate a random index for the departure and arrival times arrays
    random_departure_index = np.random.choice(len(departure_times))
    random_arrival_index = np.random.choice(len(arrival_times))
    # Create new departure and arrival times based on the randomly sampled times
    new_departure_time = pd.Timestamp.combine(all_dates[i], departure_times[random_departure_index].time())
    new_arrival_time = pd.Timestamp.combine(all_dates[i], arrival_times[random_arrival_index].time())
    # Append the sample to the synthetic data
    synthetic_data_departure.append(new_departure_time)
    synthetic_data_arrival.append(new_arrival_time)

        
"""
Generate synthetic data for numeric columns using PCA-based augmentation.use PCA to create new 
data points by rotating and scaling the existing data. 
This can help to generate new examples that are similar to the original data but not identical

"""
        
# Select the columns to apply PCA on
cols = ["temperature_depart", "temperature_arr"]

# Fit PCA to the selected columns
pca = PCA(n_components=len(cols))
pca.fit(df[cols])

# Generate synthetic data using PCA
n_samples = 364 
synthetic_data_temp= []
for i in range(n_samples):
    # Generate random coefficients for the principal components
    coeffs = np.random.normal(size=pca.n_components_)
    # Transform the coefficients to original space
    sample = pca.inverse_transform(coeffs)
    # Append the sample to the synthetic data
    synthetic_data_temp.append(sample)    
    
# Create a new dataframe with the synthetic data for temperature
synthetic_temp_df = pd.DataFrame(synthetic_data_temp, columns=cols)
    
        
# Create a new dataframe with the synthetic data for the current member
member_synthetic_data = pd.DataFrame({"member_name": [member] * len(all_dates),
                                          "dep_time": synthetic_data_departure,
                                          "arr_time": synthetic_data_arrival})

#create data for each member changing number of members. Above code generate data for the last member in 
#members = ["Elena","Kirill","Egor","Nika"]

synthetic_data_Nika = pd.concat([member_synthetic_data,synthetic_temp_df],axis=1)
synthetic_data_Nika

#synthetic_data_Egor = pd.concat([member_synthetic_data,synthetic_temp_df],axis=1)
#synthetic_data_Egor

#synthetic_data_Kirill = pd.concat([member_synthetic_data,synthetic_temp_df],axis=1)
#synthetic_data_Kirill

#synthetic_data_Elena = pd.concat([member_synthetic_data,synthetic_temp_df],axis=1)
#synthetic_data_Elena



Unnamed: 0,member_name,dep_time,arr_time,temperature_depart,temperature_arr
0,Nika,2022-01-01 07:10:00,2022-01-01 12:50:00,19.766153,21.400703
1,Nika,2022-01-02 08:30:00,2022-01-02 14:20:00,20.224813,19.770466
2,Nika,2022-01-03 07:05:00,2022-01-03 21:40:00,20.042489,20.410246
3,Nika,2022-01-04 07:10:00,2022-01-04 19:05:00,18.568053,19.860628
4,Nika,2022-01-05 09:55:00,2022-01-05 12:50:00,20.628060,19.172584
...,...,...,...,...,...
359,Nika,2022-12-26 06:30:00,2022-12-26 21:40:00,21.105803,20.264727
360,Nika,2022-12-27 08:30:00,2022-12-27 14:30:00,19.708161,18.576844
361,Nika,2022-12-28 06:30:00,2022-12-28 18:04:00,18.748322,18.659032
362,Nika,2022-12-29 06:30:00,2022-12-29 18:20:00,20.873758,19.276777


In [157]:
#save our dataframes for each memeber into csv 
synthetic_data_Nika.to_csv('Nika.csv', index=False)

In [30]:
#read all files and combine them into one file
Nika=pd.read_csv('Nika.csv')
Nika.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   member_name         364 non-null    object 
 1   dep_time            364 non-null    object 
 2   arr_time            364 non-null    object 
 3   temperature_depart  364 non-null    float64
 4   temperature_arr     364 non-null    float64
dtypes: float64(2), object(3)
memory usage: 14.3+ KB


In [31]:
#read all files and combine them into one file
Egor=pd.read_csv('Egor.csv')
Egor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   member_name         364 non-null    object 
 1   dep_time            364 non-null    object 
 2   arr_time            364 non-null    object 
 3   temperature_depart  364 non-null    float64
 4   temperature_arr     364 non-null    float64
dtypes: float64(2), object(3)
memory usage: 14.3+ KB


In [32]:
#read all files and combine them into one file
Kirill=pd.read_csv('Kirill.csv')
Kirill.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   member_name         364 non-null    object 
 1   dep_time            364 non-null    object 
 2   arr_time            364 non-null    object 
 3   temperature_depart  364 non-null    float64
 4   temperature_arr     364 non-null    float64
dtypes: float64(2), object(3)
memory usage: 14.3+ KB


In [33]:
#read all files and combine them into one file
Elena=pd.read_csv('Elena.csv')
Elena.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   member_name         364 non-null    object 
 1   dep_time            364 non-null    object 
 2   arr_time            364 non-null    object 
 3   temperature_depart  364 non-null    float64
 4   temperature_arr     364 non-null    float64
dtypes: float64(2), object(3)
memory usage: 14.3+ KB


In [34]:
data_full=pd.concat([Nika, Egor,Kirill, Elena], ignore_index=True)
data_full.sample(10)

Unnamed: 0,member_name,dep_time,arr_time,temperature_depart,temperature_arr
826,Kirill,2022-04-09 08:30:00,2022-04-09 14:20:00,17.51419,19.837978
978,Kirill,2022-09-08 06:30:00,2022-09-08 19:05:00,19.716447,20.886903
1415,Elena,2022-11-20 10:05:00,2022-11-20 21:40:00,20.243684,18.670529
610,Egor,2022-09-04 08:10:00,2022-09-04 18:05:00,20.445395,21.326805
644,Egor,2022-10-08 07:10:00,2022-10-08 14:20:00,20.742712,19.145552
316,Nika,2022-11-13 06:50:00,2022-11-13 18:20:00,19.888478,18.998053
1315,Elena,2022-08-12 08:30:00,2022-08-12 19:05:00,21.066937,19.671417
803,Kirill,2022-03-17 06:30:00,2022-03-17 19:05:00,17.206657,21.336919
1421,Elena,2022-11-26 06:50:00,2022-11-26 18:04:00,19.12226,19.439294
171,Nika,2022-06-21 10:05:00,2022-06-21 20:55:00,19.191896,20.757814


In [35]:
#Sanity check. If False - no null data
data_full.isna().sum().any()

False

In [36]:
data_full.to_csv('Data_full.csv', index=False)

In [37]:
#describe our continuous variables
data_full.describe()

Unnamed: 0,temperature_depart,temperature_arr
count,1456.0,1456.0
mean,19.838918,19.764893
std,0.990703,0.996728
min,16.530267,16.398486
25%,19.179352,19.10291
50%,19.829429,19.778188
75%,20.484862,20.442598
max,22.613164,23.481918


In [38]:
# Convert the departure and arrival time columns to datetime objects
data_full['dep_time'] = pd.to_datetime(data_full['dep_time'])
data_full['arr_time'] = pd.to_datetime(data_full['arr_time'])

## 2. Feature engineering
<a class='anchor' id = 'part2'></a>
 - Add more data in our dataset:
     - geolocations
     - type of transport
     - time to reach home
 - Build the model to switch on/off our thermostat based on rules and cretae a boolean target variable 'Thermostat_distance_control'

In [39]:
# Create a new column to indicate whether each family member is at home or not with lag 10min
#True means this member is home

now = pd.Timestamp.now().time()
ten_minutes_plus = (pd.Timestamp.combine(pd.Timestamp.now().date(), now) + pd.Timedelta(minutes=10)).time()

data_full['member_at_home'] = ((data_full['dep_time'].dt.time >= ten_minutes_plus) | 
                       (data_full['arr_time'].dt.time <= ten_minutes_plus))
data_full.sample(7)

Unnamed: 0,member_name,dep_time,arr_time,temperature_depart,temperature_arr,member_at_home
1176,Elena,2022-03-26 07:05:00,2022-03-26 14:30:00,20.638353,18.900856,True
932,Kirill,2022-07-24 07:01:00,2022-07-24 12:50:00,18.622723,20.571444,True
1233,Elena,2022-05-22 07:10:00,2022-05-22 13:30:00,20.033186,19.687415,True
174,Nika,2022-06-24 08:30:00,2022-06-24 18:20:00,20.518813,20.227328,True
638,Egor,2022-10-02 10:05:00,2022-10-02 18:20:00,19.189755,19.652581,True
1240,Elena,2022-05-29 08:10:00,2022-05-29 14:20:00,19.44077,20.297779,True
805,Kirill,2022-03-19 08:30:00,2022-03-19 18:04:00,19.817164,21.10839,True


In [None]:
#for modeling let's take time with better distribution between people at home or out 

#time_str = '14:00:45'
#time_test = (pd.to_datetime(time_str, format='%H:%M:%S')+pd.Timedelta(minutes=10)).time()

#data_full['member_at_home'] = ((data_full['dep_time'].dt.time >= time_test)
#                        | (data_full['arr_time'].dt.time <= time_test))
#data_full.sample(4)

In [40]:
#sanity check
data_full['member_at_home'].value_counts()

True     1158
False     298
Name: member_at_home, dtype: int64

### 2.1 Add geolocation in the model, type of transport and time to reach home

In [41]:
import math
from geopy import distance
from geopy.point import Point
import random

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time


# Load the dataset into a Pandas DataFrame
data_full = pd.read_csv('/Users/Elena/Desktop/Smart_thermostat/Data for thermostat project/Data_full.csv')
# Convert the departure and arrival time columns to datetime objects
data_full['dep_time'] = pd.to_datetime(data_full['dep_time'])
data_full['arr_time'] = pd.to_datetime(data_full['arr_time'])

data_full.info()

# Create a new column to indicate whether each family member is at home or not with lag 10min
#True means this member is home

now = pd.Timestamp.now().time()
ten_minutes_plus = (pd.Timestamp.combine(pd.Timestamp.now().date(), now) + pd.Timedelta(minutes=10)).time()

data_full['member_at_home'] = ((data_full['dep_time'].dt.time >= ten_minutes_plus) | 
                       (data_full['arr_time'].dt.time <= ten_minutes_plus))
data_full.sample(7)

"""
#1
Create additional features to switch on/off a thermostat based on information about members' 
departure or arrival time with 10 min lag and temperature at home at the moment compared to compfort temperature

"""
# Convert 'arr_time' and 'dep_time' to date only and create 2 additional columns
data_full['arr_date'] = data_full['arr_time'].dt.date
data_full['dep_date'] = data_full['dep_time'].dt.date

# Calculate the number of family members at home for each day
data_full['num_at_home'] = data_full.groupby(['dep_date','arr_date'])['member_at_home'].transform('sum')


# Set the comfortable temperature based on the number of family members at home
data_full['comfortable_temp'] = np.where(data_full['num_at_home'] > 0, 20, 15)

# Calculate the thermostat status based on the first arrival and last departure time for each day
data_full['last_departure_time'] = data_full.groupby('dep_date')['dep_time'].transform('max')
data_full['first_arrive_time'] = data_full.groupby('arr_date')['arr_time'].transform('min')

data_full['thermostat_status'] = np.where(((data_full['dep_time'] <= data_full['last_departure_time'])
                                           & (data_full['num_at_home']>0)) |
                                          ((data_full['arr_time'] >= data_full['first_arrive_time'])
                                           & (data_full['num_at_home']>0)), 'on', 'off')

# Apply the temperature and thermostat status calculations to each row in the DataFrame
def control_thermostat(row):
    if row['num_at_home']>0:
        if row['temperature_depart'] < row['comfortable_temp']:
            return row['thermostat_status']
        elif row['temperature_arr'] < row['comfortable_temp']:
            return row['thermostat_status']
        else:
            return 'off'
    else:
        return 'off'

data_full['thermostat_temp_control'] = data_full.apply(control_thermostat, axis=1)


"""
#2
Add geolocation and calculate the distance between two coordinates using the Haversine formula.
Namely, calculate the great circle distance between two points 
on the earth (specified in decimal degrees)

Compare distance from home with a threshold distance=4km. 

"""


# Set the coordinates of the home in Burnaby (Edmonds station)
home_coords = (49.2123, 122.9592)

# Set the distance threshold to 4000 meters
dist_threshold = 4000  # in metre

# Initialize the family member locations
family_member_locations = {
    'Elena': (49.2258, 123.0039),#Metrotown Station
    'Kirill': (49.2827, 122.8490),#Coquitlam
    'Egor': (49.2462, 123.1162),#Vancouver
    'Nika': (49.0504, 122.3045)#Abbostford
}

# Generate geo locations
out_home_locations = []

for member,location in family_member_locations.values():
    try:
        address = geolocator.reverse(location)
        latitude = address.latitude
        longitude = address.longitude
        out_home_locations.append((longitude, latitude))
    except:
        # Handle any geocoding errors
        pass

# Generate geo locations for our members
out_home_locations = list(family_member_locations.values())

# Shuffle the locations to randomize the assignment
random.shuffle(out_home_locations)

# Assign the coordinates to DataFrame
data_full['latitude'] = [out_home_locations[i % len(out_home_locations)][0] for i in range(len(data_full))]
data_full['longitude'] = [out_home_locations[i % len(out_home_locations)][1] for i in range(len(data_full))]

def haversine_distance(lat1, lon1, lat2, lon2):
    # Convert degrees to radians
    lat1 = math.radians(lat1)
    lon1 = math.radians(lon1)
    lat2 = math.radians(lat2)
    lon2 = math.radians(lon2)

    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    r = 6371  # Radius of the Earth in kilometers
    distance = r * c * 1000  # Convert to meters
    return distance


def distance_out_home(row):
    member = row['member_name']
    location = family_member_locations[member]
    if location is not None:
        latitude, longitude = location
        # If the member's location is not None, calculate the distance to the home coordinates
        dist = haversine_distance(home_coords[0], home_coords[1], latitude, longitude)
        #print(f"Distance to {member}: {dist} meters")  # Add this line for debugging
        return dist
    return None

data_full['distance_out'] = round(data_full.apply(distance_out_home, axis=1),1)

# Apply the distance, family members at home and comfort temperature at the arrival moment

def temperature_control_distance(row):
    dist = row['distance_out']
    num_at_home = row['num_at_home']
    temp_arr=row['temperature_arr']
    if dist<=dist_threshold and temp_arr<20:
        return 'on'
    elif num_at_home >0: 
        return 'on'
    else:
        return 'off'
    
data_full['temperature_distance_control'] = data_full.apply(temperature_control_distance, axis=1)


"""
#3
Add transport and time for every member to reach home in sec.

"""
#create a dictionary with transports
transports= {
    'Elena': 'walk',
    'Kirill': 'car',
    'Egor': 'train',
    'Nika': 'bus'
}

transport_to_home=[]

for member, transport in transports.items():
    try:
        transport_to_home.append(transport)
    except:
        pass
# Generate transport for our members
transport_to_home = list(transports.values())

# Shuffle transports to randomize the assignment
random.shuffle(transport_to_home)

# Assign the transport to DataFrame
data_full['transport'] = [transport_to_home[i % len(transport_to_home)]for i in range(len(data_full))]
    

#create a feature with time for every kind of transport
def time_reach_home(row):
    member=row['member_name']
    transport=row['transport']
    dist = row['distance_out']
    if transport =='car':
        return dist/11
    elif transport =='walk':
        return dist/1
    elif transport =='bus':
        return dist/4
    elif transport =='train':
        return dist/18
    else: 
        return 0
    
data_full['time_to_home'] = round(data_full.apply(time_reach_home, axis=1),1)
data_full.sample(7)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1456 entries, 0 to 1455
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   member_name         1456 non-null   object        
 1   dep_time            1456 non-null   datetime64[ns]
 2   arr_time            1456 non-null   datetime64[ns]
 3   temperature_depart  1456 non-null   float64       
 4   temperature_arr     1456 non-null   float64       
dtypes: datetime64[ns](2), float64(2), object(1)
memory usage: 57.0+ KB


Unnamed: 0,member_name,dep_time,arr_time,temperature_depart,temperature_arr,member_at_home,arr_date,dep_date,num_at_home,comfortable_temp,last_departure_time,first_arrive_time,thermostat_status,thermostat_temp_control,latitude,longitude,distance_out,temperature_distance_control,transport,time_to_home
429,Egor,2022-03-07 06:50:00,2022-03-07 21:40:00,18.679999,19.08909,False,2022-03-07,2022-03-07,3,20,2022-03-07 10:05:00,2022-03-07 12:50:00,on,on,49.2462,123.1162,12007.4,on,train,667.1
534,Egor,2022-06-20 09:55:00,2022-06-20 20:55:00,20.864902,20.97757,False,2022-06-20,2022-06-20,0,15,2022-06-20 10:05:00,2022-06-20 20:55:00,off,off,49.2258,123.0039,12007.4,off,bus,3001.8
1271,Elena,2022-06-29 07:01:00,2022-06-29 18:05:00,19.963647,19.393118,True,2022-06-29,2022-06-29,2,20,2022-06-29 09:55:00,2022-06-29 14:30:00,on,on,49.2827,122.849,3576.8,on,walk,3576.8
142,Nika,2022-05-23 07:10:00,2022-05-23 19:05:00,20.083366,19.711151,True,2022-05-23,2022-05-23,3,20,2022-05-23 07:10:00,2022-05-23 13:30:00,on,on,49.2258,123.0039,50922.7,on,bus,12730.7
1254,Elena,2022-06-12 07:10:00,2022-06-12 21:40:00,18.817921,20.306145,False,2022-06-12,2022-06-12,3,20,2022-06-12 08:10:00,2022-06-12 12:50:00,on,on,49.2258,123.0039,3576.8,on,bus,894.2
1194,Elena,2022-04-13 08:30:00,2022-04-13 20:55:00,20.439253,20.198163,False,2022-04-13,2022-04-13,3,20,2022-04-13 08:30:00,2022-04-13 12:50:00,on,off,49.2258,123.0039,3576.8,on,bus,894.2
449,Egor,2022-03-27 06:50:00,2022-03-27 14:20:00,19.640377,19.420425,True,2022-03-27,2022-03-27,3,20,2022-03-27 08:30:00,2022-03-27 14:20:00,on,on,49.2462,123.1162,12007.4,on,train,667.1


In [42]:
#sanity check
#data_full['num_at_home'].value_counts()
#data_full[data_full['num_at_home']==0].sample(3)
#data_full['latitude']
#data_full.groupby(['member_name'])['member_name','transport','latitude','longitude'].sample(4)
#data_full.info()
data_full['temperature_distance_control'].value_counts()
#data_full[['transport','member_name','latitude','longitude']]

on     1449
off       7
Name: temperature_distance_control, dtype: int64

In [None]:
#save data for modeling into the file 'Data for model'
#data_full.to_csv('Data_for_model.csv', index=False)

**Conclusion**
 - we created rules to switch on/off thermostat based on departure and arrival time and temperature at this time as well as time needed to reach home taking into account geolocations and distance out of home and motion speed for every type of transport 
 
**Next steps**
 - EDA priocess and data visualization
 - Run models starting from Logistic regression, then Random Forest and XG Boosting