<a href="https://colab.research.google.com/github/olsenme/Azure-Lighthouse-samples/blob/master/Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will take a look at the list of basic processes that are done in a Machine Learning project.

## Loading Data
In this section we will import all the necessary packages and load the dataset(s) we plan to work on. We will use the 
<a href='https://github.com/WomenWhoCode/WWCodeDataScience/blob/master/Intro_to_MachineLearning/data/PlayGolf.csv'> Play Golf data </a>.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [0]:
#Note: the link to the raw file should be used below
playgolf = pd.read_csv("https://raw.githubusercontent.com/WomenWhoCode/WWCodeDataScience/master/Intro_to_MachineLearning/data/PlayGolf.csv")
playgolf

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,cool,normal,True,yes
1,overcast,hot,high,False,yes
2,overcast,hot,normal,False,yes
3,overcast,mild,high,True,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,rainy,mild,high,False,yes
7,rainy,mild,high,True,no
8,rainy,mild,normal,False,yes
9,sunny,cool,normal,False,yes


Before analysing the data, it is important to set aside some data to test your model on. You are not allowed to peek into this data to understand its distribution.

In [0]:
y = playgolf["play"] #target response
X = playgolf.drop(["play"], axis=1) #example data/ input features

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = False)
print("Train-Test split complete!")
print("- X_train = " + str(X_train.shape) + " | " + str(X_train.columns.tolist()))
print("- y_train = " + str(y_train.shape))
print("- X_test = " + str(X_test.shape) + " | " + str(X_test.columns.tolist()))
print("- y_test = " + str(y_test.shape))

Train-Test split complete!
- X_train = (11, 4) | ['outlook', 'temperature', 'humidity', 'windy']
- y_train = (11,)
- X_test = (3, 4) | ['outlook', 'temperature', 'humidity', 'windy']
- y_test = (3,)


We are only allowed to analyze the **_train** data.

In [0]:
X_train

Unnamed: 0,outlook,temperature,humidity,windy
0,overcast,cool,normal,True
1,overcast,hot,high,False
2,overcast,hot,normal,False
3,overcast,mild,high,True
4,rainy,cool,normal,False
5,rainy,cool,normal,True
6,rainy,mild,high,False
7,rainy,mild,high,True
8,rainy,mild,normal,False
9,sunny,cool,normal,False


In [0]:
y_train

0     yes
1     yes
2     yes
3     yes
4     yes
5      no
6     yes
7      no
8     yes
9     yes
10     no
Name: play, dtype: object

In [0]:
train = X_train.copy()
train["play"] = y_train
train

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,cool,normal,True,yes
1,overcast,hot,high,False,yes
2,overcast,hot,normal,False,yes
3,overcast,mild,high,True,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,rainy,mild,high,False,yes
7,rainy,mild,high,True,no
8,rainy,mild,normal,False,yes
9,sunny,cool,normal,False,yes


#### What do we do next?
1. Analyze the dataset to see which columns are useful for the machine learning model.
   * Apply feature engineering, repeat step 1 till you are satistified with data
2. Apply Machine learning model on train data
   * Test it against our test data and generate the accuracy of the model
   * If the accuracy is not desirable, change parameters/algorithm in step 2 and repeat till ideal accuracy is achieved.
3. Train your machine learning model on 100% of the data and use this for any incoming prediction requests

![](https://github.com/WomenWhoCode/WWCodeDataScience/blob/master/Intro_to_MachineLearning/misc/ML_Project_Steps.png?raw=true)