# Weather Dataset - Temperature Prediction

## _Description_: 

- **Formatted Date**: Date in yyyy-mm-dd hr(in 24 hr format) format.
- **Summary**: Summary of weather.
- **Precip Type**: Type of precipitation.
- **Temperature**: Temperature in degrees Centigrade.
- **Apparent Temperature Â©**: Apparent temperature in degrees Centigrade.
- **humidity**: Humidity at recorded time.
- **Wind Speed**: Wind speed in km/hrs.
- **Wind Bearing**: Wind Bearing in degrees.
- **Visibility**: 


In [3]:
# Importing the necessary modules
import statistics 
import scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

In [4]:
# Importing dataset into data frame variable
df = pd.read_csv("WeatherHistoryDataset.csv")

#printing the first 5 rows of the dataset
df.head()
df

Unnamed: 0.1,Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature ©,Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,0,2012-04-01 00:00:00.000 +0200,Breezy and Overcast,rain,9.444444,5.511111111,0.52,35.42,340,16.1,0,1002.8,Partly cloudy until evening and breezy in the ...
1,1,2012-04-01 01:00:00.000 +0200,Mostly Cloudy,rain,8.333333,5.194444444,0.45,20.93,320,16.1,0,1004.1,Partly cloudy until evening and breezy in the ...
2,2,2012-04-01 02:00:00.000 +0200,Breezy and Mostly Cloudy,rain,6.855556,2.244444444,0.54,33.2304,322,15.1501,0,1004.97,Partly cloudy until evening and breezy in the ...
3,3,2012-04-01 03:00:00.000 +0200,Mostly Cloudy,rain,6.111111,1.888888889,0.57,25.76,310,16.1,0,1005.9,Partly cloudy until evening and breezy in the ...
4,4,2012-04-01 04:00:00.000 +0200,Breezy and Overcast,rain,6.111111,1.6055555560000000,0.51,28.98,310,16.1,0,1006.0,Partly cloudy until evening and breezy in the ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
35072,35072,2015-09-09 19:00:00.000 +0200,Partly Cloudy,rain,16.011111,16.01111111,0.46,0.1288,340,16.1,0,1019.23,Partly cloudy starting in the morning continui...
35073,35073,2015-09-09 20:00:00.000 +0200,Partly Cloudy,rain,13.061111,13.06111111,0.55,6.9552,331,15.5526,0,1019.81,Partly cloudy starting in the morning continui...
35074,35074,2015-09-09 21:00:00.000 +0200,Partly Cloudy,rain,11.161111,11.16111111,0.67,7.9212,352,16.1,0,1020.33,Partly cloudy starting in the morning continui...
35075,35075,2015-09-09 22:00:00.000 +0200,Clear,rain,10.583333,10.58333333,0.69,6.4239,342,16.1,0,1019.75,Partly cloudy starting in the morning continui...


In [5]:
# Finding out the general information about the dataset
df.describe()

Unnamed: 0.1,Unnamed: 0,Temperature (C),Loud Cover
count,35077.0,35077.0,35077.0
mean,17538.0,12.190872,0.0
std,10126.002033,9.549309,0.0
min,0.0,-21.822222,0.0
25%,8769.0,4.911111,0.0
50%,17538.0,12.15,0.0
75%,26307.0,18.894444,0.0
max,35076.0,38.861111,0.0


In [6]:
# Since the NaN values are represented as " " it has to be converted to NaN value so that we can clean the data efficiently
df = df.replace(" ", np.nan)

In [7]:
# Counting the number of NaN values through each column
df.isnull().sum()

Unnamed: 0                  0
Formatted Date              0
Summary                   203
Precip Type               144
Temperature (C)             0
Apparent Temperature ©    199
Humidity                  144
Wind Speed (km/h)         260
Wind Bearing (degrees)    178
Visibility (km)           253
Loud Cover                  0
Pressure (millibars)      159
Daily Summary               0
dtype: int64

## Data Cleaning:

In [8]:
# Removing rows where the Precip type and Summary is NaN
df = df[df["Precip Type"].notna()]
df = df[df["Summary"].notna()]

In [9]:
# Dropping the Loud Cover column as it does not contain useful information
df.drop('Loud Cover', inplace=True, axis=1)

In [10]:
#check
df.isnull().sum()

Unnamed: 0                  0
Formatted Date              0
Summary                     0
Precip Type                 0
Temperature (C)             0
Apparent Temperature ©    197
Humidity                  144
Wind Speed (km/h)         259
Wind Bearing (degrees)    175
Visibility (km)           252
Pressure (millibars)      159
Daily Summary               0
dtype: int64

In [11]:
# Imputing the NaN values with the mean strategy 

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:,4:10].values)
df.iloc[:,4:10] = imputer.transform(df.iloc[:,4:10].values)

In [12]:
#check
df.isnull().sum()

Unnamed: 0                  0
Formatted Date              0
Summary                     0
Precip Type                 0
Temperature (C)             0
Apparent Temperature ©      0
Humidity                    0
Wind Speed (km/h)           0
Wind Bearing (degrees)      0
Visibility (km)             0
Pressure (millibars)      159
Daily Summary               0
dtype: int64

## Testing of Hypothesis

In [16]:
# 1. For Column Temperature (C)

#Find sample mean and sample standard deviation
number_of_values=len(df) #no of values in the column
Sample_data=df["Temperature (C)"]

sample_mean=statistics.mean(Sample_data) 
sample_sd=statistics.stdev(Sample_data) 

#Hypothesis
#Null Hypothesis
# H0: The mean Temparature is 12 degree C (mu=12)

#Alternate Hypothesis
# H1: The mean Temparature is not equal to 12 degree C (mu != 12)

population_mean_from_hypothesis=12

#determining if the test is one tailed or two tailed and alloting alpha value

test="two_sided_test"
if(test=="two_sided_test"):
    number_of_tails=2
    alpha = 0.025
else:
    number_of_tails=1
    alpha =0.05

# Z score
z_score=(sample_mean-population_mean_from_hypothesis)/(sample_sd/np.sqrt(number_of_values))
# p Value
p_value = scipy.stats.norm.sf(z_score) 

if p_value > alpha:
    print('Null hypothesis accepted')

else:
    print('Null hypothesis accepted rejected and alternate hypothesis is accepted')

print('z_score=%.10f' % (z_score))
print('p_value=%.10f' % (p_value))
print('sample_mean=%.10f' % (sample_mean))
print('sample_sd=%.10f' % (sample_sd))
print('The test is '+str(number_of_tails)+' tailed test')

Null hypothesis accepted rejected and alternate hypothesis is accepted
z_score=3.5786121490
p_value=0.0001727118
sample_mean=12.1836710080
sample_sd=9.5649922267
The test is 2 tailed test


In [19]:
# 2. For Column Apparent Temperature ©
#Find sample mean and sample standard deviation
number_of_values=len(df) #no of values in the column
Sample_data=df["Apparent Temperature ©"]

sample_mean=statistics.mean(Sample_data) 
sample_sd=statistics.stdev(Sample_data) 

#Hypothesis
#Null Hypothesis
# H0: The mean Apparent Temperature © is 11.5 degree C (mu=11.5)

#Alternate Hypothesis
# H1: The mean Apparent Temperature © is not equal to 11.5 degree C (mu != 11.5)

population_mean_from_hypothesis=11.5

#determining if the test is one tailed or two tailed and alloting alpha value

test="two_sided_test"
if(test=="two_sided_test"):
    number_of_tails=2
    alpha = 0.025
else:
    number_of_tails=1
    alpha =0.05

# Z score
z_score=(sample_mean-population_mean_from_hypothesis)/(sample_sd/np.sqrt(number_of_values))
# p Value
p_value = scipy.stats.norm.sf(z_score) 

if p_value > alpha:
    print('Null hypothesis accepted')

else:
    print('Null hypothesis rejected and alternate hypothesis is accepted')

print('z_score=%.10f' % (z_score))
print('p_value=%.10f' % (p_value))
print('sample_mean=%.10f' % (sample_mean))
print('sample_sd=%.10f' % (sample_sd))
print('The test is '+str(number_of_tails)+' tailed test')

Null hypothesis accepted
z_score=-6.4732208071
p_value=1.0000000000
sample_mean=11.1289711267
sample_sd=10.6818365960
The test is 2 tailed test
