# Introduction
This tutoral was created to attempt to give a someone with little to no experiance with data
science an intro into data processing and machine learning.
In this tutorial we will attempt to learn about machine learning by doing some predictions. Well cover a number of steps that are required to create a model:
    - import the data from a cvs to a dataframe
    - preprocess the data
        - remove null values
        - create x and y set for features and target comunm
        - apply scaling
        - split the data into train test sets
    - create a model
        - fit the data to the model
        - get a prediction from the model
    - confusion matrix to view predictions
        
To attempt to do this we will use a dataset that contains information about weather in seattle 
more specfically wheather or not it rained covering everyday from 1948 to 2017. 
The data in the data set is date, percipation, TMAX(Temperature max) TMIN(Temperature min) and 
wheather or not it rained.

  
The data set can be downloaded from kaggle at the following link:<br>-  https://www.kaggle.com/rtatman/did-it-rain-in-seattle-19482017

Other useful information/links<br>

Pandas:<br>-  https://pandas.pydata.org/<br>

Scikit learn:<br>-  https://scikit-learn.org/stable/<br>

Train test split:<br>-  https://scikitlearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html<br>

Logistic Regression:<br>-  https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html<br>

Scikit Learn min max_scale:<br>https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html

Confusion Matrix:<br>-  https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html


In [1]:
# The first thing we will do is import some libraries we will need that are available in python.
import os
import pandas as pd

# Import the data
The data is located in a csv file which is in the same folder as this notebook. We will first get the csv file and then read it into a data frame.

In [2]:
weather_data = os.path.join("seattleWeather_1948-2017.csv")

In [3]:
# place the data from csv into a data frame.
df = pd.read_csv(weather_data)

In [4]:
# Check that the data was imported
# This command shows the first 5 entries of a data frame
df.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
0,1948-01-01,0.47,51,42,True
1,1948-01-02,0.59,45,36,True
2,1948-01-03,0.42,45,35,True
3,1948-01-04,0.31,45,34,True
4,1948-01-05,0.17,45,32,True


In [5]:
# This command will show the last 5 entries of a data frame
df.tail()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
25546,2017-12-10,0.0,49,34,False
25547,2017-12-11,0.0,49,29,False
25548,2017-12-12,0.0,46,32,False
25549,2017-12-13,0.0,48,34,False
25550,2017-12-14,0.0,50,36,False


In [6]:
# Check the shape of the data (number of rows and columns, 25551 rows with 5 Columns) 
df.shape

(25551, 5)

# Preprocessing
In the preprocessing stage we will check for null values and deal with them if we find any. We will create a feature set (columns used for the predictions) and a target set (column we are trying to predict).

In [7]:
# Check if any rows contain null
df.isnull().sum().max()

3

There are three rows that contain null values, although this is a low number of null values for dataset with 25551 rows we have to deal with them to ensure that it does not affect the results there are a number of ways we can deal with the null values we can just remove them, we can set them to an average value based on the column containing it or we can set them to a median value which is a value right in the middle of all the values in the colums. Generally 3 out of 25000 probably wont cause a problem but for practice and good standards we will deal with the null values.

In [8]:
# Below command will drop any row which contains a  null value in any column.  
df = df.dropna()

In [9]:
# data frame now contains 3 less rows
df.shape

(25548, 5)

## Prediction columns and target column
In the next step we will split the dataset into an x and y set, x will be the columns we want to use 
for the predictions and y will be the column we want to  predict, i.e. x will be PCRP, TMAX, TMIN and y will be the RAIN column.

In [10]:
# y=y.astype('bool') tells the system that the value is a boolean

features = ['PRCP', 'TMAX', 'TMIN']
x = df[features]
y = df['RAIN']
y=y.astype('bool')
x.head()

Unnamed: 0,PRCP,TMAX,TMIN
0,0.47,51,42
1,0.59,45,36
2,0.42,45,35
3,0.31,45,34
4,0.17,45,32


In [11]:
y.head()

0    True
1    True
2    True
3    True
4    True
Name: RAIN, dtype: bool

## Scaling
We will apply scaling to the dataset, the purpose of scaling is to scale the features to a given range i.e. (1, 0).

In [13]:
# by default minmax_scale uses the values 1, 0 so we do not need to provide the range
from sklearn.preprocessing import minmax_scale
x = minmax_scale(x)

## Train, Test, Split
we will split the dat up into a training set ans a test set. The data will be split 70/30, 70 to training and 30 to testing. We will do this by using the train_test_split function from sklearn

In [14]:
# In order to train and test the data we will use the train_test_split function from sklearn
from sklearn.model_selection import train_test_split

In [15]:
# Split the data to test and train sets. This data is spilt 70-30 this can be changed by changing the 
# test_size in the parameters
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.3, random_state=0)

In [16]:
# Check the sets
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(17883, 3)
(7665, 3)
(17883,)
(7665,)


## Creating the model
In this tutorial we will use a Logistic Regression model from the sklearn module 
to perform the predictions. Logistic regressionis is a predictive analysis. Logistic regression is used to to describe data and explain th relationshp between one dependent binary variable and ine or more nominal, ordinal, interval, or ratio-level independent variables. With python and scikit learn it is very easy to swap between other models if you wish for further
practice. Decision trees, random forrest, SVM, or Gradient boosting can be used.

In [17]:
# Create the model
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [18]:
# fit the data to the model

lr.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
# Check how well our model is performing
lr.score(x_train, y_train)

0.8612648884415367

In [20]:
# Run a prediction
prediction = lr.predict(x_test)

In [21]:
# Get an accuracy score to test how well the model will perform
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))

0.855968688845


In [22]:
from sklearn import metrics

# Confusion Matrix 
In order to show the predictions of the model we will use a confusion matrix. The confusion matrix for this 
model will be a 2*2 array as this is a binary classification.

In [23]:
cnf_matrix = metrics.confusion_matrix(y_test, prediction)
cnf_matrix 

array([[4099,  294],
       [ 810, 2462]], dtype=int64)

The way to read this is the diagonal from top left to bottom right are correct predictions and from top right to bottom left are incorrect predictions.

In case anyone is confused :-) as to where the values in the matrix array are comming from:
4099 + 294 + 810 + 2462 = 7665 = the test set size. The array above shows 4099 accurate prediction for No Rain and 294 incoerrect predictions and 2462 correct predictions for Rain and 810 incorrect predictions

In [24]:
print('No Rain accurate predictions : ', cnf_matrix[0,0])
print('Correct rain predictions : ', cnf_matrix[1,1])

No Rain accurate predictions :  4099
Correct rain predictions :  2462


In [25]:
result = metrics.accuracy_score(y_test, prediction) *100
print("Accuracy = {0:.2f}%".format(result))

Accuracy = 85.60%


The accuracy of this model is 85.60% which is good anything over 80% is considered to be good accuracy

# CONCLUSION:
    - On this jupyter notebook we learned how to:
    - read in data from a csv file into a python dataframe
    - split the data into x (the features) and y (the target column)
    - process the data i.e. remove the null values and apply scaling
    - split the data to create a training and test set 
    - create a model and fit the data to it, here we used Logistic Regression
    - get predictions based on the test set
    - create a confusion matrix to interperate the results and make sense of them
    - use metrics i.e. accuracy to show how the model is performing

 With the way I created this tutorial anyone should be able to swap the model for any
 other classification model without to many issues. All the information you would need is available from scikit learn here:-  https://scikit-learn.org/stable/index.html 