# Machine Learning

## What is Machine Learning?
This refers to the 

## Types of Machine Learning
- Unsupervised
- Supervised
- Reinforcement Learning

## Training Data
This consists of multiple features and labels which contain different data points. The features are the columns and the values are the labels 

### Back to the types:
- For supervised learning the training data is labeled
- Unsupervised learning: The data only has features

## Machine Learning Workflow
In short, You get historical data and convert it into a machine learning model.

## Our Scenario
Dataset of NYC property sales 2015-2019 which includes house size (sqr ft), Neighbourhood, year built, Sale price etc

We want to predict the prices of houses.


### Step 1:
- **Extract Features**: Here we see what feature (column) is necessary for the model
<div style="color:red; font-size:small">JSYK: Correlation ratio spans between -1 and 1. It determines the relationship between different variables. Any positive value yields a positive correlation, otherwise it yields a negative correlation. Any correlation value 0.8 and yields strongly positive correlation.</div>
- **Split Dataset**: Select the features (independent variable)
- **Train Data**: Data => Algorithm 
- **Train Model**
- **Evaluate**: After the model makes predictions, they have to be evaluated. You can check error rate by comparing the predictions to the real result from the test dataset.


and More...

In [3]:
import pandas as pd

In [4]:
data = pd.read_csv('50_Startups.csv')

In [5]:
data

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


In [6]:
data.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64

In [7]:
features = data.drop("Profit", axis=1)

In [8]:
features

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida
5,131876.9,99814.71,362861.36,New York
6,134615.46,147198.87,127716.82,California
7,130298.13,145530.06,323876.68,Florida
8,120542.52,148718.95,311613.29,New York
9,123334.88,108679.17,304981.62,California


In [9]:
target = data["Profit"]

In [10]:
target

0     192261.83
1     191792.06
2     191050.39
3     182901.99
4     166187.94
5     156991.12
6     156122.51
7     155752.60
8     152211.77
9     149759.96
10    146121.95
11    144259.40
12    141585.52
13    134307.35
14    132602.65
15    129917.04
16    126992.93
17    125370.37
18    124266.90
19    122776.86
20    118474.03
21    111313.02
22    110352.25
23    108733.99
24    108552.04
25    107404.34
26    105733.54
27    105008.31
28    103282.38
29    101004.64
30     99937.59
31     97483.56
32     97427.84
33     96778.92
34     96712.80
35     96479.51
36     90708.19
37     89949.14
38     81229.06
39     81005.76
40     78239.91
41     77798.83
42     71498.49
43     69758.98
44     65200.33
45     64926.08
46     49490.75
47     42559.73
48     35673.41
49     14681.40
Name: Profit, dtype: float64

In [11]:
from sklearn.preprocessing import LabelEncoder

In [12]:
le = LabelEncoder()

In [13]:
features["State"] = le.fit_transform(features["State"])

In [14]:
features

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,2
1,162597.7,151377.59,443898.53,0
2,153441.51,101145.55,407934.54,1
3,144372.41,118671.85,383199.62,2
4,142107.34,91391.77,366168.42,1
5,131876.9,99814.71,362861.36,2
6,134615.46,147198.87,127716.82,0
7,130298.13,145530.06,323876.68,1
8,120542.52,148718.95,311613.29,2
9,123334.88,108679.17,304981.62,0


In [15]:
from sklearn.model_selection import train_test_split

### Using train_test_split yields:
- X_train:  Independent variable. Data for training the model (70% of data in our case)
- Y_train: Training data for target variable
- X_Test: Validation dataset for independent variable
- Y-Test: Validation dataset for target variable ("Profit")

### Next we split both variables' (independent & target) datasets to allocate some data to be used to test.

In [16]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

In [23]:
features_test.shape

(10, 4)

In [22]:
target_train.shape

(40,)

In [25]:
from sklearn.linear_model import LinearRegression

In [26]:
lm = LinearRegression() #lm = linear model

In [29]:
model = lm.fit(X=features_train, y=target_train)

In [30]:
model

LinearRegression()

In [31]:
features_test

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
40,28754.33,118546.05,172795.67,0
49,0.0,116983.8,45173.06,0
43,15505.73,127382.3,35534.17,2
13,91992.39,135495.07,252664.93,0
27,72107.6,127864.55,353183.81,2
39,38558.51,82982.09,174999.3,0
33,55493.95,103057.49,214634.81,1
22,73994.56,122782.75,303319.26,1
7,130298.13,145530.06,323876.68,1
32,63408.86,129219.61,46085.25,0


In [41]:
prediction = model.predict(features_test)

In [42]:
result = pd.DataFrame({"Actual Value":target_test, "Predicted Value": prediction, "Error Margin": abs(target_test - prediction)})

In [43]:
result

Unnamed: 0,Actual Value,Predicted Value,Error Margin
40,78239.91,78184.211169,55.698831
49,14681.4,51865.356846,37183.956846
43,69758.98,61216.233688,8542.746312
13,134307.35,129356.461116,4950.888884
27,105008.31,115201.141217,10192.831217
39,81005.76,86656.418688,5650.658688
33,96778.92,99529.662241,2750.742241
22,110352.25,116249.714367,5897.464367
7,155752.6,160025.228381,4272.628381
32,97427.84,100758.394935,3330.554935
