# Name: **SIMRAN ANAND**
# Registration Number: **19BCD7243**
# Weather Forecast in Seattle using Decision Trees 

## Lab 3 assignment of CSE4029

### Importing important libraries:

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sb
import os

from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing

from sklearn import metrics
from sklearn.model_selection import train_test_split 
from sklearn.metrics import classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

### Reading the data:

In [2]:
met_df = pd.read_csv('../input/seattleweather-19482017/seattleWeather_1948-2017.csv')
print(met_df.head()); print(); print()
met_df.info()

The description and unit of each variable:
- DATE = the date of the observation
- PRCP = the amount of precipitation, in inches
- TMAX = the maximum temperature for that day, in degrees Fahrenheit
- TMIN = the minimum temperature for that day, in degrees Fahrenheit
- RAIN = TRUE if rain was observed on that day, FALSE if it was not

## Data Cleaing:

### Step 1: Correcting wrong values or outliers:

In [3]:
met_df.describe(include = 'all')

The data description makes sense, and the mean, min, and max values of each variable is reasonable meaning there should not be a mistake in the data (such as a very large temperature of 200 F).

### Step 2: Imputing missing values:

There are only three missing data points for each PRCP and RAIN. So, we use median for PRCP and mode for RAIN to fill in the gaps.

In [4]:
met_df.isna().sum()

In [5]:
P_median = met_df.PRCP.median()
R_mode   = met_df.RAIN.mode()[0]

met_df.PRCP.fillna(P_median, inplace = True)
met_df.RAIN.fillna(R_mode, inplace = True)

met_df.isna().sum()

### Step 3: Converting boolean variable to dummy variable:
- We should change RAIN from True/False to 1/0.
- We then replace the new variable with the original one.

In [6]:
from sklearn.preprocessing import LabelEncoder
RAIN_encode = LabelEncoder().fit_transform(met_df.RAIN)
RAIN_encode

In [7]:
met_df['RAIN'] = RAIN_encode

met_df.describe(include = 'all')

### Making sure all predictors are independent

### Insightful plots

In [8]:
%matplotlib inline
rcParams['figure.figsize'] = 6, 5
sb.set_style('whitegrid')

sb.pairplot(met_df, palette = 'husl', hue = 'RAIN')
plt.show()

In [9]:
sb.heatmap(met_df.corr(), vmin=-1, vmax=1, annot=True, cmap = 'RdBu_r')
plt.show()

In [10]:
sb.scatterplot(x = 'TMIN', y ='TMAX', data = met_df, hue = 'RAIN')
plt.show()

- Ax expected, TMIN and TMAX are highliy correlated, so we drop TMIN that has lower correlation with RAIN.

In [11]:
fig, axis = plt.subplots(1, 2,figsize=(10,4))
sb.boxplot(x = 'RAIN', y ='TMAX', data = met_df, ax = axis[0], showfliers = False)
sb.boxplot(x = 'RAIN', y ='TMIN', data = met_df, ax = axis[1], showfliers = False)
plt.show()

- Here, each TMAX and TMIN is grouped based on RAIN.
- Again, we see that TMAX is a better predictor of RAIN.

- We should also drop 'PRCP' variable: if we know the amount of precipitation on each day, we can certainly say whether it rains or not on that day.

In [12]:
met_df.drop(['TMIN', 'PRCP','DATE'], inplace = True, axis=1)
met_df.head()

## Implementing MLAs:

### Spliting the data into test and train sets:


In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(met_df.drop('RAIN', axis=1),
                                                   met_df['RAIN'], test_size=0.2, random_state=10)                             


### Using a variety of MLAs to get the best results

- The outcome is binary, so we can use Logistic Regression, Decision Tree, or Naive Bayes. 
- We also use ensemble algorithms (such as Random Forest) to see if the score can be improved.

In [14]:
all_classifiers = {'Ada Boost': AdaBoostClassifier(),
                 'Random Forest': RandomForestClassifier(n_estimators=50, min_samples_leaf=1, min_samples_split=2, max_depth=4),
                 'Gaussian NB': GaussianNB(),
                 'Logistic Regression': LogisticRegression(solver='liblinear'),#fit_intercept=True,
                 'Decision Tree' : DecisionTreeClassifier(),
                  'SVC': SVC()} #probability = False 

In [15]:
ML_name = []
ML_accuracy = []
for Name,classifier in all_classifiers.items():
    classifier.fit(X_train,Y_train)
    Y_pred = classifier.predict(X_test)
    ML_accuracy.append(metrics.accuracy_score(Y_test,Y_pred)) 
    ML_name.append(Name) 

In [16]:
rcParams['figure.figsize'] = 8, 4
plt.barh(ML_name, ML_accuracy, color = 'brown')
plt.xlabel('Accuracy Score', fontsize = '14')
plt.ylabel('Machine Learning Algorithms', fontsize = '14')
plt.xlim([0.65, 0.685])
plt.show()

### Tuning models with hyper parameters:


### **Decision Tree**:


In [17]:
criteri       = ['gini', 'entropy']
min_samp_lf   = [1, 2, 5, 10]
min_samp_splt = [2, 4, 8, 12]
maxim_depth   = [2, 4, 8, 12, None]

max_score = 0

for c in criteri:
    for ml in min_samp_lf:
        for ms in min_samp_splt:
            for md in maxim_depth:
                MLA = DecisionTreeClassifier(criterion=c, min_samples_leaf=ml, min_samples_split=ms, max_depth=md)
                MLA.fit(X_train,Y_train)
                Y_pred = MLA.predict(X_test)
                if metrics.accuracy_score(Y_test,Y_pred) > max_score:
                    max_score, c_best, l_best, s_best, d_best = metrics.accuracy_score(Y_test,Y_pred), c, ml, ms, md

print('maximum accuracy score, criterion, min_samples_leaf, min_samples_split, max_depth:')
print(max_score, c_best, l_best, s_best, d_best)

### Ada Boost Classifier:

In [18]:
learning_R    = [1, 2, 3]
random_st     = [None, 20]
n_estimat     = [50, 100]

max_score = 0

for lr in learning_R:
    for rs in random_st:
        for ne in n_estimat:
            MLA = AdaBoostClassifier(random_state=rs, learning_rate=lr, n_estimators=ne)
            MLA.fit(X_train,Y_train)
            Y_pred = MLA.predict(X_test)
            if metrics.accuracy_score(Y_test,Y_pred) > max_score:
                max_score, r_best, l_best, n_best = metrics.accuracy_score(Y_test,Y_pred), rs, lr, ne

print('maximum accuracy score, random_state, learning_rate, n_estimators:')
print(max_score, r_best, l_best, n_best)

## Observations:

- It seems that tuning hyper-parameters for various MLAs would give us an accuracy score of ~ 68.1%, meaning that our MLAs can predict the rain correctly in 68% of times for test datasets.

- This is better than the baseline: we should be able to predict rain by 50% accuracy only by tossing a coin. Moreover, there are 57% of instances of not rain and 43% instances of rain. So, if we always select not rain, we would get a score of 57%. So far, we improved the accuracy score by 11.1%.


# Thank you
# - SIMRAN ANAND 
# - 19BCD7243