# Weather Dataset - Temperature Prediction

## _Description_: 

- **Formatted Date**: Date in yyyy-mm-dd hr(in 24 hr format) format.
- **Summary**: Summary of weather.
- **Precip Type**: Type of precipitation.
- **Temperature**: Temperature in degrees Centigrade.
- **Apparent Temperature Â©**: Apparent temperature in degrees Centigrade.
- **humidity**: Humidity at recorded time.
- **Wind Speed**: Wind speed in km/hrs.
- **Wind Bearing**: Wind Bearing in degrees.
- **Visibility**: Visibility in km.


In [41]:
# Importing the necessary modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
import statistics 
import scipy


In [42]:
# Importing dataset into data frame variable
df = pd.read_csv("WeatherHistoryDataset.csv")

#printing the first 5 rows of the dataset
df.head()

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature ©,Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2012-04-01 00:00:00.000 +0200,Breezy and Overcast,rain,9.444444,5.511111111,0.52,35.42,340,16.1,0,1002.8,Partly cloudy until evening and breezy in the ...
1,2012-04-01 01:00:00.000 +0200,Mostly Cloudy,rain,8.333333,5.194444444,0.45,20.93,320,16.1,0,1004.1,Partly cloudy until evening and breezy in the ...
2,2012-04-01 02:00:00.000 +0200,Breezy and Mostly Cloudy,rain,6.855556,2.244444444,0.54,33.2304,322,15.1501,0,1004.97,Partly cloudy until evening and breezy in the ...
3,2012-04-01 03:00:00.000 +0200,Mostly Cloudy,rain,6.111111,1.888888889,0.57,25.76,310,16.1,0,1005.9,Partly cloudy until evening and breezy in the ...
4,2012-04-01 04:00:00.000 +0200,Breezy and Overcast,rain,6.111111,1.605555556,0.51,28.98,310,16.1,0,1006.0,Partly cloudy until evening and breezy in the ...


In [43]:
# Finding out the general information about the dataset
df.describe()

Unnamed: 0,Temperature (C),Loud Cover
count,35077.0,35077.0
mean,12.190872,0.0
std,9.549309,0.0
min,-21.822222,0.0
25%,4.911111,0.0
50%,12.15,0.0
75%,18.894444,0.0
max,38.861111,0.0


In [44]:
# Since the NaN values are represented as " " it has to be converted to NaN value so that we can clean the data efficiently
df = df.replace(" ", np.nan)

In [45]:
# Counting the number of NaN values through each column
df.isnull().sum()

Formatted Date              0
Summary                   203
Precip Type               144
Temperature (C)             0
Apparent Temperature ©    199
Humidity                  144
Wind Speed (km/h)         260
Wind Bearing (degrees)    178
Visibility (km)           253
Loud Cover                  0
Pressure (millibars)      159
Daily Summary               0
dtype: int64

## Data Cleaning:

In [46]:
# Removing rows where the Precip type and Summary is NaN
df = df[df["Precip Type"].notna()]
df = df[df["Summary"].notna()]

In [47]:
# Dropping the Loud Cover column as it does not contain useful information
df.drop('Loud Cover', inplace=True, axis=1)

In [48]:
# Preparing the Label Encoder
le = preprocessing.LabelEncoder()

# Encoding the values of Precip Type and Summary to unique values so that we can use them for analysis
df["Precip Type"] = le.fit_transform(df["Precip Type"])
df["Summary"] = le.fit_transform(df["Summary"])

In [49]:
# Imputing the NaN values with the mean strategy 

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.iloc[:,4:10].values)
df.iloc[:,4:10] = imputer.transform(df.iloc[:,4:10].values)

## Testing of Hypothesis

In [50]:
# 1. For Column Temperature (C)

#Find sample mean and sample standard deviation
number_of_values=len(df) #no of values in the column
Sample_data=df["Temperature (C)"]

sample_mean=statistics.mean(Sample_data) 
sample_sd=statistics.stdev(Sample_data) 

#Hypothesis
#Null Hypothesis
# H0: The mean Temparature is 12 degree C (mu=12)

#Alternate Hypothesis
# H1: The mean Temparature is not equal to 12 degree C (mu != 12)

population_mean_from_hypothesis=12

#determining if the test is one tailed or two tailed and alloting alpha value

test="two_sided_test"
if(test=="two_sided_test"):
    number_of_tails=2
    alpha = 0.025
else:
    number_of_tails=1
    alpha =0.05

# Z score
z_score=(sample_mean-population_mean_from_hypothesis)/(sample_sd/np.sqrt(number_of_values))
# p Value
p_value = scipy.stats.norm.sf(z_score) 

if p_value > alpha:
    print('Null hypothesis accepted')

else:
    print('Null hypothesis is rejected and alternate hypothesis is accepted')

print('z_score=%.10f' % (z_score))
print('p_value=%.10f' % (p_value))
print('sample_mean=%.10f' % (sample_mean))
print('sample_sd=%.10f' % (sample_sd))
print('The test is '+str(number_of_tails)+' tailed test')


Null hypothesis is rejected and alternate hypothesis is accepted
z_score=3.5786121490
p_value=0.0001727118
sample_mean=12.1836710080
sample_sd=9.5649922267
The test is 2 tailed test


In [51]:
# 2. For Column Apparent Temperature ©
#Find sample mean and sample standard deviation
number_of_values=len(df) #no of values in the column
Sample_data=df["Apparent Temperature ©"]

sample_mean=statistics.mean(Sample_data) 
sample_sd=statistics.stdev(Sample_data) 

#Hypothesis
#Null Hypothesis
# H0: The mean Apparent Temperature © is 11.5 degree C (mu=11.5)

#Alternate Hypothesis
# H1: The mean Apparent Temperature © is not equal to 11.5 degree C (mu != 11.5)

population_mean_from_hypothesis=11.5

#determining if the test is one tailed or two tailed and alloting alpha value

test="two_sided_test"
if(test=="two_sided_test"):
    number_of_tails=2
    alpha = 0.025
else:
    number_of_tails=1
    alpha =0.05

# Z score
z_score=(sample_mean-population_mean_from_hypothesis)/(sample_sd/np.sqrt(number_of_values))
# p Value
p_value = scipy.stats.norm.sf(z_score) 

if p_value > alpha:
    print('Null hypothesis accepted')

else:
    print('Null hypothesis is rejected and alternate hypothesis is accepted')

print('z_score=%.10f' % (z_score))
print('p_value=%.10f' % (p_value))
print('sample_mean=%.10f' % (sample_mean))
print('sample_sd=%.10f' % (sample_sd))
print('The test is '+str(number_of_tails)+' tailed test')


Null hypothesis accepted
z_score=-6.4732208071
p_value=1.0000000000
sample_mean=11.1289711267
sample_sd=10.6818365960
The test is 2 tailed test
