# **Data science 2- Homework 2 solution**
***

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
from IPython.display import display,Markdown
import statsmodels.api as sm

## **1. Load the "Individual household electric power consumption" dataset** 
***

First, we will load the dataset and observe it:

In [2]:
# first define datatypes for the columns
dtypes = {'Date': str,
          'Time': str,
          'Global_active_power': float,
          'Global_reactive_power': float,
          'Voltage': float,
          'Global_intensity': float,
          'Sub_metering_1': float,
          'Sub_metering_2': float,
          'Sub_metering_3': float}

'''
fetch the dataset (acknowledging that the seperator is ';' and missing values are marked as '?'
'''

file_path = 'household_power_consumption.txt'
power_consumption_df = pd.read_csv(file_path, sep=';',dtype=dtypes,na_values='?')
power_consumption_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'household_power_consumption.txt'

We can see that we have 9 features:
* Date
* Time
* Global_active_power
* Global_reactive_power
* Voltage
* Global_intensity
* Sub_metering_1
* Sub_metering_2
* Sub_metering_3

However, our target (active energy consumed every minute) is still not present in the dataset and needs to be calculated. Also, we would like to combine date and time to a one feature called: "datetime".

In [None]:
# combine date and time into datetime
power_consumption_df['Datetime'] = pd.to_datetime(power_consumption_df['Date'] + ' ' + power_consumption_df['Time'], format='%d/%m/%Y %H:%M:%S')

# Set the datetime column as index
power_consumption_df.set_index('Datetime', inplace=True)

# Drop the original Date and TIme columns
power_consumption_df.drop(columns=['Date', 'Time'], inplace=True)
power_consumption_df['active_power_per_minute'] = (power_consumption_df['Global_active_power'] * 1000 / 60) - (power_consumption_df['Sub_metering_1'] + power_consumption_df['Sub_metering_2'] + power_consumption_df['Sub_metering_3'])
power_consumption_df.head()

## 2. **EDA**
***

### **Visualize time series trends**
*** 

We will first view the time series trends

In [None]:
def visualize_time_trends(data,columns_to_omit=None,resample=None):
    num_plots = len(data.columns) - len(columns_to_omit) if columns_to_omit is not None else len(data.columns)
    num_rows = num_cols = math.ceil(math.sqrt(num_plots))
    
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15,10))
    if columns_to_omit is  None:
        columns_to_omit = []
        
    i = j = 0
    
    for column in data.columns:
        if column not in columns_to_omit:
            if resample is None:
                axes[i, j].plot(data[column], label=column)
                
            else:
                axes[i, j].plot(data[column].resample(resample).mean(), label=column)
                
            axes[i, j].set_title(f'Time series for {column}')
            axes[i, j].set_xlabel('Datetime')
            axes[i, j].set_ylabel(f"value")
            axes[i,j].legend()
            j += 1
            
            if j == num_cols:
                j = 0
                i += 1
                
    if math.sqrt(num_plots) < num_cols:
        axes[-1,-1].axis('off') # hide the last plot
    
    plt.tight_layout()
    plt.grid(True)
    plt.show()

visualize_time_trends(power_consumption_df)

let's try to resample the data to get some better visualization:


In [None]:
visualize_time_trends(data=power_consumption_df,resample='D')   # daily resampling

In [None]:
visualize_time_trends(data=power_consumption_df,resample='W')   # weekly resampling

In [None]:
visualize_time_trends(data=power_consumption_df,resample='ME')   # monthly resampling

In [None]:
visualize_time_trends(data=power_consumption_df,resample='YE')   # annual resampling

### **Check for seasonality and cyclical patterns**
***

According to the time series trends we can say the following:
1. **Global Active Power**

    * **Seasonality:** It seems like there are strong seasonal patterns with regular peaks and troughs (best appeared in the monthly resampling). This phenomenon could be explained by high power consumption during certain times of the year (for example, hot summer days)
         
   * **Cyclical Patterns:** There may be some longer-term trends, but they are overshadowed and difficult to isolate from the clear seasonal patterns without further analysis.

2. **Global reactive power**
    * **Seasonality:** Similar to Global Active Power, we can see a strong seasonal pattern (which makes sense, since the reactive power are the losses from the electrical appliances of the consumer, so they should have the same seasonality as the active power)
      
    * **Cyclical Patterns:** The same as the active power - longer term trends are less apparent, since there is a strong seasonal pattern.

3. **Voltage** 
    * **Seasonality:** The voltage shows less clear seasonal patterns (it doesn't seem to have a constant period) compared to power consumption, though there are still some periodic fluctuations (it seems clearer in the weekly or monthly resampling)
      
    * **Cyclical Patterns:** If there are any cyclical patterns in the voltage, they are not very prominent.

4. **Global Intensity**
    * **Seasonality:** There is a clear seasonal pattern, with regular fluctuations which again could be explained by periods of higher electricity demand.
      
    * **Cyclical Patterns:** Similar to the Global Active Power, the cyclical trends are less evident due to the dominant seasonal patterns.

5. **Sub Metering 1,2, and 3:** 
    * **Seasonality:** Each sub metering shows distinct seasonal patterns, likely corresponding to specific appliances that have different regular usage cycles.
      
    * **Cyclical Patterns:** Less apparent, as the data is dominated by strong seasonal patterns.

6. **Active Power Per Minute**
    * **Seasonality:** There is a noticeable seasonal trend with regular peaks and troughs.
      
    *  **Cyclical Patterns:** Any cyclical trends are overshadowed by the strong seasonality.


**Conclusion**

This dataset exhibits strong seasonal patterns across most of the features, as well as the target variable (active power per minute). The seasonal patterns are consistent and predictable, and most likely correspond to higher and lower power demand during different periods of the year.

Cyclical patterns in this dataset are harder to distinguish from the seasonal ones, due to the dominance of the seasonal patterns.

### **Analyze distribution of power consumption**
***

In [None]:
def visualize_distributions(data,columns_to_omit=None):
    num_plots = len(data.columns) - len(columns_to_omit) if columns_to_omit is not None else len(data.columns)
    num_rows = num_cols = math.ceil(math.sqrt(num_plots))
    
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15,10))
    
    if columns_to_omit is  None:
        columns_to_omit = []
        
    i = j = 0
    
    for column in data.columns:
        if column not in columns_to_omit:
            sns.histplot(x=data[column], label=column,kde=True,ax=axes[i,j])
            axes[i, j].set_title(f'Distribution of {column}')
            axes[i, j].set_xlabel('Value')
            axes[i, j].set_ylabel("Count")
            axes[i,j].legend()
            j += 1
            
            if j == num_cols:
                j = 0
                i += 1
                
    if math.sqrt(num_plots) < num_cols:
        axes[-1,-1].axis('off') # hide the last plot
    
    plt.tight_layout()
    plt.grid(True)
    plt.show()

visualize_distributions(data=power_consumption_df)

From the distributions we can draw the following conclusions:
* The Global Active power seems to be consisted of two different normal distributions, one centered near 0, and the other centered at approximately 1.7. That could imply some different seasonal patterns within the year, one with lower power consumption, and the other - with a higher one.

* The Global reactive power seems to be less sparse, and will most likely be near 0. It implies that the electricity network is efficient and doesn't suffer from many losses :) .

* The voltage is clearly normally distributed with a mean around 240V (which makes sense since it's a French household). The voltage consumption is sparse, and that could be explained by different voltage consumptions from different appliances.

* Just like Global Active Power, the Global Intensity also seems to be consisted of two normal distributions which can imply different seasonal patterns - one centered at approximately 0 and one centered at approximately 5 .

* Sub Metering 1 seems to be most likely near 0. It could imply that this type of appliance consume much less energy.
* Sub metering 2 seems to have much less value counts... It could imply that this column has missing values (we will check that right in the next cell).

* Sub Metering 3 seems to be more distributed. It has some peaks between 0 and 2 (most of the values lie there) but it also has a decent amount of values between 15-20. It could be interpreted either as a high power consumption period, or as outliers.

* Active power per minute seems slightly normally distributed, with a peak between 0 and 20 (let's say it's about 5...)

### **Identify and handle missing values and outliers**
***

First, let's check for missing values and where they lie (**Note:** we have already noticed that missing values are marked as '?', and made Pandas interpret that as N.A.):

In [None]:
display(power_consumption_df.isna().sum())
display(Markdown(f'##### {power_consumption_df.isna().sum().iloc[0]/ power_consumption_df.shape[0] * 100}% of the data is missing'))

We can see that all the measurements columns has missing values, all with the same amount of missing values (25979 which is approximately 1.25% of the data, just as the dataset documentation described).

Now let's check for outliers:

In [None]:
def visualize_boxplots(data,columns_to_omit=None):
    num_plots = len(data.columns) - len(columns_to_omit) if columns_to_omit is not None else len(data.columns)
    num_rows = num_cols = math.ceil(math.sqrt(num_plots))
    
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15,10))
    
    if columns_to_omit is  None:
        columns_to_omit = []
        
    i = j = 0
    
    for column in data.columns:
        if column not in columns_to_omit:
            sns.boxplot(x=data[column], label=column,ax=axes[i,j])
            axes[i, j].set_title(f'boxplot  of {column}')
            axes[i, j].set_xlabel('Value')
            axes[i,j].legend()
            j += 1
            
            if j == num_cols:
                j = 0
                i += 1
                
    if math.sqrt(num_plots) < num_cols:
        axes[-1,-1].axis('off') # hide the last plot
    
    plt.tight_layout()
    plt.grid(True)
    plt.show()

visualize_boxplots(data=power_consumption_df)

From the boxplots, we can see that the dataset has a lot of outliers. However, that could be explained by different seasonal patterns of power consumption within the year. Therefore, we wouldn't like to manipulate these outliers, on the contrary - we would like to use them in order to predict high / low power consumptions on certain periods during the year.

So, let's first handle the missing values. We will impute them using forward fill method. This method  will fill the missing values with the last observed value. First, since our target variable is calculated, we will remove it. Then, we will forward fill all the missing values, and only then - we will re-calculate our target variable (active power per minute)

In [None]:
power_consumption_df_no_missing = power_consumption_df.copy()
power_consumption_df_no_missing.drop(columns=['active_power_per_minute'],inplace=True)

for _column in power_consumption_df_no_missing.columns:
    # forward fill the column
    power_consumption_df_no_missing[_column] = power_consumption_df_no_missing[_column].ffill() 

# check for the existence of missing values again
power_consumption_df_no_missing.isna().sum()

Now that we have no missing values, let's re-calculate our target variable (active_power_per_minute):

In [None]:
power_consumption_df_no_missing['active_power_per_minute'] = (power_consumption_df_no_missing['Global_active_power'] * 1000 / 60) - (power_consumption_df_no_missing['Sub_metering_1'] + power_consumption_df_no_missing['Sub_metering_2'] + power_consumption_df_no_missing['Sub_metering_3'])
power_consumption_df_no_missing.head()