### Assignment: Introduction to Modeling

Seattle is one of the rainiest places in the world. Even so, it is worth asking the question "Will it rain tomorrow?". Imagine you are headed to sleep at a hotel in downtown Seattle. The next day's activities are supposed to include walking around outside most of the day. You want to know if it will rain or not (you don't really care how much rain, just a simple yes or no will do), which will greatly impact what you choose to wear and carry around (like an umbrella). Build a heuristic model to predict if it will rain tomorrow.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import time
import datetime
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, recall_score, precision_score

In [2]:
# Load Data
df = pd.read_csv('seattle_weather_1948-2017.csv')

In [3]:
# Find nulls in PRCP column
df[pd.isnull(df['PRCP'])]

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
18415,1998-06-02,,72,52,
18416,1998-06-03,,66,51,
21067,2005-09-05,,70,52,


In [4]:
# Find nulls in RAIN column
df[pd.isnull(df['RAIN'])]

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
18415,1998-06-02,,72,52,
18416,1998-06-03,,66,51,
21067,2005-09-05,,70,52,


In [5]:
# We are inserting false since it is the most frequent possibility 

def RAIN_INSERTION(cols):
    """
    Insert False where NaN values are present
    """
    # Note: the input is a dataframe and we are selecting the first column
    RAIN=cols[0]
    if pd.isnull(RAIN):
        return False
    else:
        return RAIN

In [6]:
# We are replacing null values in the precipitation with the mean value

def PRCP_INSERTION(col):
    """
    Insert the Mean of PRCP where NaN values are present
    """
    # Note: the input is a dataframe and we are selecting the first column
    PRCP=col[0]
    if pd.isnull(PRCP):
        return df['PRCP'].mean()
    else:
        return PRCP

In [7]:
# Apply the functions --> handling missing values 
df['RAIN']=df[['RAIN']].apply(RAIN_INSERTION,axis=1)
df['PRCP']=df[['PRCP']].apply(PRCP_INSERTION,axis=1)

In [8]:
# Check for NaN values
df[pd.isnull(df['RAIN'])]

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN


In [9]:
df[pd.isnull(df['PRCP'])]

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN


In [10]:
# First quartile (Q1)
Q1 = np.percentile(df['TMIN'], 25, interpolation = 'midpoint')
  
# Third quartile (Q3)
Q3 = np.percentile(df['TMIN'], 75, interpolation = 'midpoint')
  
# Interquaritle range (IQR)
IQR = Q3 - Q1

# lower bound outliers --> Q1 - 1.5(IQR)
# higher bound outliers --> Q3 + 1.5 (IQR)
print(Q1- 1.5*(IQR))

17.0


In [11]:
#Dropping the outliers from TMIN column
df = df.drop(df[df['TMIN']<17 ].index)

In [12]:
#Dropping the outliers from TMAX columns i.e. the value more than 100
df = df.drop(df[(df['TMAX']>97.5) | (df['TMAX']< 21.5)].index)

In [13]:
#Dropping the outliers from PRCP columns i.e. the value more than 0.275
df = df.drop(df[(df['PRCP']>0.25) | (df['PRCP']< -0.15) ].index)

In [14]:
# Reset index and drop index column
df = df.reset_index().drop("index", axis=1)

In [15]:
# Create function to perform our heuristic

# rain --> rain
# rain, 'unknown' --> rain 

def heuristic(df):
    
    """
1: If it rained the day before
     1.1) If the day before or 2 days before or todays temperature is 50 degrees or less 
       1.1.1) If the PRCP is zero-->Then it did not rain 
       1.1.2) If the PRCP has a value --> Then it did rain
     1.2) If the day before or 2 days before or todays temperature above 50 degrees  
2: Else If it rained today --> Then it will rain
3: Else --> It will not rain   
Frist two rows are predicted false be default
    """
    
    preds = []
    for x in range(len(df)):
        if x <2:
            preds.append(False)
        else:
            # x --> now
            # x-1 --> yesterday 
            # x-2 --> two days ago "The day before yesterday"
            if (df.iloc[x-1]["RAIN"] == True):
                if(df.iloc[x-1]["TMAX"]<=50)| (df.iloc[x-2]["TMAX"] <=50)| (df.iloc[x]["TMAX"] <=50):
                        if(df.iloc[x]["PRCP"] ==0):
                            preds.append(False)
                        else: 
                            preds.append(True)
                else:
                    preds.append(False)
            
            elif (df.iloc[x]["RAIN"] == True):
                preds.append(True)
            else:
                preds.append(False)
    return preds

In [16]:
# Apply Heuristic
df["preds"] = heuristic(df )

df.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds
0,1948-01-05,0.17,45,32,True,False
1,1948-01-08,0.04,48,35,True,False
2,1948-01-09,0.12,50,31,True,True
3,1948-01-11,0.01,42,32,True,True
4,1948-01-12,0.0,41,26,False,False


In [17]:
# Determine Accuracy

# Create function to to find values

def calc_confuse(df):
    
    "Calculate all possible results of a confusion matrix"

    # Hold all possible values and set to zero
    FP = np.zeros(len(df))
    TP = np.zeros(len(df))
    FN = np.zeros(len(df))
    TN = np.zeros(len(df))
    
    for x in range(len(df)):
        
        # True Positive
        if (df["RAIN"].iloc[x] == True) & (df["preds"].iloc[x] == True):
            TP[x] = 1
        # True Negative
        elif (df["RAIN"].iloc[x] == False) & (df["preds"].iloc[x] == False):
            TN[x] = 1
        # False Negative
        elif (df["RAIN"].iloc[x] == True) & (df["preds"].iloc[x] == False):
            FN[x] = 1
        # False Positive
        else:
            FP[x] = 1
    
    return FP, TP, FN, TN

In [18]:
# Extract results and create columns for each
w,x,y,z = calc_confuse(df)

df["FP"] = w
df["TP"] = x
df["FN"] = y
df["TN"] = z

# Look at 10 random rows to determin accuracy
df.sample(10)

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN,preds,FP,TP,FN,TN
16600,2001-02-07,0.0,41,26,False,False,0.0,0.0,0.0,1.0
17559,2004-02-03,0.06,49,39,True,True,0.0,1.0,0.0,0.0
1550,1953-02-01,0.2,47,42,True,True,0.0,1.0,0.0,0.0
9931,1979-12-06,0.0,50,47,False,False,0.0,0.0,0.0,1.0
1896,1954-03-24,0.0,56,37,False,False,0.0,0.0,0.0,1.0
6763,1969-11-15,0.0,54,40,False,False,0.0,0.0,0.0,1.0
18869,2008-03-22,0.0,59,34,False,False,0.0,0.0,0.0,1.0
12866,1989-03-27,0.24,51,41,True,False,0.0,0.0,1.0,0.0
20759,2014-04-15,0.02,58,46,True,True,0.0,1.0,0.0,0.0
16625,2001-03-06,0.0,62,40,False,False,0.0,0.0,0.0,1.0


In [19]:
# Calculate Accuracy
(sum(df["TP"]) + sum(df["TN"])) / len(df)

0.9158635180194583

In [20]:
# Baseline Model Prediction
# What would be our accuracy if we predicted the majority class
df["RAIN"].value_counts(normalize=True)

False    0.665464
True     0.334536
Name: RAIN, dtype: float64

#### Highest accuracy is 
0.9158635180194583
based on two previous days raining and the temperature to rise the prediction accuracy from 0.64 to 0.92