## Welcome to the Code Pub!

#### This is called a jupyter notebook. It is a web-based interactive computational environment for creating notebook documents in which you can write code in separate cells and run it. It is commonly used amongst Data Scientists for experimenting with data. 

Try hitting shift+enter in the cell below: 

In [51]:
print("Hello World!")

Hello World!


#### After each cell has run it does not have to be run again. All variables are saved in the scope of the notebook. Lets import the necessary packages.

In [52]:
import pandas as pd
import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 200)

## Some information about the data 

The dataset we are going to work with is from the dataset "Rain in Australia" from kaggle.com. https://www.kaggle.com/jsphyg/weather-dataset-rattle-package . The use case is to predict if it will rain tomorrow. 

## The first step is to load the data

In [53]:
## We use pandas to load the data 
df = pd.read_csv('WeatherAUS.csv')

## Next step is to explore the data

In [75]:
# Let's start by printing the data and look at the columns

#df

In [64]:
# Now let's have a look at the data types of the columns 

#df.dtypes

In [63]:
# Lets have a look at the columns that are "objects"

#df[['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']]

In [65]:
# For the non-numeric values it is necessary to see the number of unique values. 
# We will have to transform these columns to numeric values later. 
# Therefore we would like to know how many unique values they have. 

#object_column_names = ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']
#for column in object_column_names:
#    print(column + ': ' + str(len(df[column].unique())) + " unique values.")

In [66]:
# We also see some columns with NaN values. Let's see how many rows per column are not NaN

#df.count().sort_values()

In [74]:
# Now let's use pandas describe() function to explore the data some more

#df.describe()

### Preprocess the data 

In [67]:
# During the exploration we saw 4 columns that have more than 40% NaN values. 
# We will start by dropping these columns because we are not sure how to replace the empty values. 

#df = df.drop(columns=['Sunshine', 'Evaporation', 'Cloud3pm', 'Cloud9am'])
#df.shape

In [68]:
# There are different eays of dealing with empty values. 
# But we don't have enough time during this workshop to explore those options.
# So we will drop all rows that have any NaN values. 

#df = df.dropna(how='any')
#df.shape

In [69]:
# The goal of this use case is to see how well we can predict if it will rain tomorrow in a given city.
# This is a boolean classification problem. 
# So we will begin with mapping the two binary columns "RainToday" and "RainTomorrow" to 0/1

#df['RainTomorrow'] = df['RainTomorrow'].map({'Yes':1, 'No':0})
#df['RainToday'] = df['RainToday'].map({'Yes':1, 'No':0})
#df

In [70]:
# We will drop the date column because we will assume that seasonal information can already be reflected in the meteorological data

#df = df.drop(columns=['Date'])

In [71]:
# We will also drop the RISK_MM column because it leaks information according to the description

#df = df.drop(columns=['RISK_MM'])

In [72]:
# For the categorical columns we will perform one hot encoding. 
# We will do this by using pandas get_dummies() function.
# Since the columns didn't have a large range of unique values, one hot encoding is a good option.

#df = pd.get_dummies(df)
#df

In [73]:
# We will look at the correlation between "RainTomorrow" and the other columns

#df.corr().loc[['RainTomorrow']]

### Split data and normalize 

In [95]:
# First we will seperate the training columns and the target column

#X = df.drop(columns=['RainTomorrow'])
#y = df['RainTomorrow']

In [96]:
# First we will split our data into test and training datasets using train_test_split from sklearn

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [97]:
# We will use the MinMax scaler to normalize the data 

#scaler = MinMaxScaler()
#columns = X.columns
#X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train))
#X_train_scaled.columns = columns

In [98]:
# Now that the scaler is fit, we will transform the test data with the scaler

#X_test_scaled = pd.DataFrame(scaler.transform(X_test))
#X_test_scaled.columns = columns

### Now we will train a model

In [88]:
# We will set the class_weight to balanced because the classes are not balanced.
# model info: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

#model = DecisionTreeClassifier(max_depth=3, class_weight="balanced")
#model.fit(X_train_scaled, y_train)

In [89]:
# The score is the number of data points that were correctly classified

#model.score(X_test_scaled, y_test)

In [91]:
# It is also important to look at the confusion matrix to see the number of correctly classified data points
# in each class.

# Let's have a look at the confusion matrix 

#predicted = clf.predict(X_test_scaled)
#true_neg, false_pos, false_neg, true_pos = confusion_matrix(y_test, predicted).ravel()
#true_neg, false_pos, false_neg, true_pos

In [92]:
# Lets plot the tree
# Which features are at the top layers? 

#from sklearn import tree
#plot = tree.plot_tree(clf, fontsize=10, filled=True)

### Now you can experiment with other models :)  

In [93]:
# Here are three example models to experiment with
# You can try and experiment with different hyperparameters
models = {
    "LogisticRegression": LogisticRegression(class_weight="balanced"), 
    "RandomForestClassifier": RandomForestClassifier(), 
    "GradientBoostingClassifier": GradientBoostingClassifier()
    # ...
    # Add more models ?! See scikit-learn documentation
}

In [94]:
# Score each model
# Would it also be interesting to print the confusion matrix?

#for key, model in models.items():
#    model.fit(X_train_scaled, y_train)
#    print(key + ": " + str(model.score(X_val_scaled, y_val)))