# Weather Dataset - Temperature Prediction

## _Description_: 

- **Formatted Date**: Date in yyyy-mm-dd hr(in 24 hr format) format.
- **Summary**: Summary of weather.
- **Precip Type**: Type of precipitation.
- **Temperature**: Temperature in degrees Centigrade.
- **Apparent Temperature Â©**: Apparent temperature in degrees Centigrade.
- **humidity**: Humidity at recorded time.
- **Wind Speed**: Wind speed in km/hrs.
- **Wind Bearing**: Wind Bearing in degrees.
- **Visibility**: 


In [1]:
# Importing the necessary modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

In [2]:
# Importing dataset into data frame variable
df = pd.read_csv("WeatherHistoryDataset.csv")

#printing the first 5 rows of the dataset
df.head()

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature ©,Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2012-04-01 00:00:00.000 +0200,Breezy and Overcast,rain,9.444444,5.511111111,0.52,35.42,340,16.1,0,1002.8,Partly cloudy until evening and breezy in the ...
1,2012-04-01 01:00:00.000 +0200,Mostly Cloudy,rain,8.333333,5.194444444,0.45,20.93,320,16.1,0,1004.1,Partly cloudy until evening and breezy in the ...
2,2012-04-01 02:00:00.000 +0200,Breezy and Mostly Cloudy,rain,6.855556,2.244444444,0.54,33.2304,322,15.1501,0,1004.97,Partly cloudy until evening and breezy in the ...
3,2012-04-01 03:00:00.000 +0200,Mostly Cloudy,rain,6.111111,1.888888889,0.57,25.76,310,16.1,0,1005.9,Partly cloudy until evening and breezy in the ...
4,2012-04-01 04:00:00.000 +0200,Breezy and Overcast,rain,6.111111,1.605555556,0.51,28.98,310,16.1,0,1006.0,Partly cloudy until evening and breezy in the ...


In [3]:
# Finding out the general information about the dataset
df.describe()

Unnamed: 0,Temperature (C),Loud Cover
count,35077.0,35077.0
mean,12.190872,0.0
std,9.549309,0.0
min,-21.822222,0.0
25%,4.911111,0.0
50%,12.15,0.0
75%,18.894444,0.0
max,38.861111,0.0


In [4]:
# Since the NaN values are represented as " " it has to be converted to NaN value so that we can clean the data efficiently
df = df.replace(" ", np.nan)

In [5]:
# Counting the number of NaN values through each column
df.isnull().sum()

Formatted Date              0
Summary                   203
Precip Type               144
Temperature (C)             0
Apparent Temperature ©    199
Humidity                  144
Wind Speed (km/h)         260
Wind Bearing (degrees)    178
Visibility (km)           253
Loud Cover                  0
Pressure (millibars)      159
Daily Summary               0
dtype: int64

## Data Cleaning:

In [6]:
# Removing rows where the Precip type and Summary is NaN
df = df[df["Precip Type"].notna()]
df = df[df["Summary"].notna()]

In [7]:
# Dropping the Loud Cover column as it does not contain useful information
df.drop('Loud Cover', inplace=True, axis=1)

In [8]:
#check
df.isnull().sum()

Formatted Date              0
Summary                     0
Precip Type                 0
Temperature (C)             0
Apparent Temperature ©    197
Humidity                  144
Wind Speed (km/h)         259
Wind Bearing (degrees)    175
Visibility (km)           252
Pressure (millibars)      159
Daily Summary               0
dtype: int64

In [9]:
# Imputing the NaN values with the mean strategy 

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:,4:10].values)
df.iloc[:,4:10] = imputer.transform(df.iloc[:,4:10].values)

In [10]:
#check
df.isnull().sum()

Formatted Date            0
Summary                   0
Precip Type               0
Temperature (C)           0
Apparent Temperature ©    0
Humidity                  0
Wind Speed (km/h)         0
Wind Bearing (degrees)    0
Visibility (km)           0
Pressure (millibars)      0
Daily Summary             0
dtype: int64