# Data Science
# Analyze temperature and humidity behavior for a day
# Dataset: DHT11 Temperature and Humidity Sensor 

**The objective of the following kernel is to analyze temperature and humidity behavior for a day through the dataset - DHT11 Temperature and Humidity Sensor. The dataset log_temp.log is a log file as an output of a python script on a Raspberry Pi 3 Model B**

**Import libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

**Open log file to know the structure and explore first rows**

In [None]:
#df = pd.read_csv("log_temp.log")
df_csv = pd.read_csv("../input/dht11-temperature-and-humidity-sensor-1-day/log_temp.csv")
df_csv.head()

In [None]:
#df = pd.read_csv("log_temp.log")
df = pd.read_csv("../input/dht11-temperature-and-humidity-sensor-1-day/log_temp.log")
df.head()

**Result: The file is not corrupted, no header and delimited by " "**

**Open log file delimited by " " without header**

In [None]:
df = pd.read_csv("../input/dht11-temperature-and-humidity-sensor-1-day/log_temp.log", sep=" ", header=None)
df.head()

**Result:  It is ok!**

**Assign column names**

In [None]:
df.columns = ["date", "hour", "temp", "humi"]
df.head()

**Explore number of entries, NaN values and column data types**

In [None]:
df.info()

**"humi" has 16 NaN values, however there is something wrong because "temp" does not have any NaN value, a greater exploration of the information is necessary**

In [None]:
df.describe(include="all")

**There are three different dates, however the analysis should be focused on one date only**

In [None]:
df["date"].value_counts()

**The date that has more records is 2019-03-15, analysis should be focused on it**

In [None]:
df["temp"].value_counts()

**There are 16 records as error, these records should be converted to NaN values**

In [None]:
df["humi"].value_counts()

**Select records of date 2019-03-15 only**

In [None]:
df = df[df.date=="2019-03-15"]

**Verify the dataset**

In [None]:
df["date"].value_counts()

**Replace error records with NaN values**

In [None]:
df = df.replace("error",np.NaN)

In [None]:
df.info()

**All NaN values will be replaced with 0000.0 to be identified**

In [None]:
df = df.fillna("0000.0")

In [None]:
df.info()

**Select the hour of the hour column**

In [None]:
df["hour"] = df["hour"].str.slice(stop=2)

**Select the numeric value of temp and humi columns**

In [None]:
df["temp"] = df["temp"].str.slice(start=2,stop=6)

In [None]:
df["humi"] = df["humi"].str.slice(start=2,stop=6)

In [None]:
df.head()

**Because date column has a single value, it is not necessary**

In [None]:
df=df.drop("date",1)
df.head()

**Reset DataFrame index**

In [None]:
df.reset_index(drop=True,inplace=True)
df.head()

**Review column data types**

In [None]:
df.dtypes

**Assign the correct data type for each column in order to manage numeric values**

In [None]:
df.hour = df.hour.astype(int)
df.temp= df.temp.astype(float)
df.humi = df.humi.astype(float)

In [None]:
df.dtypes

**At this point the dataset has 00.0 values as substitutes for NaN values, lets graph the dataset in order to visualize it and make later decisions**

In [None]:
df.groupby("hour")["temp"].mean().plot(kind="line",color="blue")
df.groupby("hour")["humi"].mean().plot(kind="line",color="orange")

**There are 00.0 values that should be replaced, however a decision must be made. In my opinion 00.0 values should be replaced with the mean of the column. Review code comments for more information**

In [None]:
#Two columns to review
columns = ["temp","humi"]
#Identify the 00.0 as the value to replace
flag = 00.0

#For each two columns, get and save the mean value of the column as a temp value in the case that the first 
#value will be 00.0, otherwise save the first value of the column as a temp value to replace in case of 00.0
for each in columns:
    if df[each].iloc[0] == flag:
        temp_t = df[each].mean()
    else:
        temp_t = df[each].iloc[0]
#In case of 00.0 replace with the temp value, otherwise update the temp value with the current value of the column
    for index, row in df.iterrows():
        if row[each] == flag:
            df.loc[index, each] = temp_t
        else:
            temp_t = df[each].iloc[index]  

In [None]:
df.describe()

**Now it is time to visualize the data**

In [None]:
df.groupby("hour")["temp"].mean().plot(kind="line",color="blue")
df.groupby("hour")["humi"].mean().plot(kind="line",color="orange")

xint2 = np.arange(df["hour"].min(), df["hour"].max()+1, 2)
plt.xticks(xint2)
yint2 = np.arange(df["temp"].min()-2, df["temp"].max()+2, 2)
plt.yticks(yint2)

plt.grid()
plt.title("2019-03-15")
plt.legend(("Temperature","Humidity"))
plt.xlabel("Hour")
plt.ylabel("°C / RH")

### Results

**On March 15, 2019 somewhere in the world there was a temperature range between 15°C and 33°C, at 6 am was the minimum temperature and the maximum at 8 am, the graph shows temperature and humidity as a mean for each hour of the day.
One important fact was that the temperature increased 18°C approximately in two hours, from 6 am to 8 am. The minimum humidity was at 8 am and at that time was the maximum temperature.
In this case there was a minimum amount of records, however you could adjust the frequency and get more records. A machine learning analysis could be developed as soon as you get more data.
Thanks!**