## Preparing and training our model on multiple factors or variables

In [1]:
import pandas as pd

In [29]:
dataset=pd.read_csv('50_Startups.csv')

In [30]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [31]:
y=dataset.iloc[:,-1] # only last or -1th column
X=dataset.iloc[:,:-1] #from first column to -1th column

In [32]:
X.shape
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


In [33]:
y.shape
y.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

### In Machine Learning the models process only numeric data values, so the data in string form can not be processed and the model will give an error. Hence we will need to convert the string data to numeric form.
## This process is called DATA PREPROCESSING.

In [34]:
from sklearn.preprocessing import LabelEncoder

### Label Encoder is used to convert String datatype to Number datatype.
### It gives a unique number to every unique string in the data set.

In [35]:
encode_x=LabelEncoder()

states=X.iloc[:,-1] # last or -1th column of x is the states
new_states=encode_x.fit_transform(states)

In [36]:
X.iloc[:,-1]=new_states

In [37]:
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,2
1,162597.7,151377.59,443898.53,0
2,153441.51,101145.55,407934.54,1
3,144372.41,118671.85,383199.62,2
4,142107.34,91391.77,366168.42,1


##### Label encoding is not a good technique for encoding. For example, if a string is encoded to 0, then whatever weight is multiplied to that data it's final result will be zero, hence the given string value loses its importance. 

##### To dodge this problem we seperate the categorical feature category wise into columns and give binary values in the column. If the string is present for a row the value will be 1 else the value remains 0.

## This type of encoding is called ONE HOT ENCODING.

###### These new column or features are called dummy features or dummy variables

In [51]:
from sklearn.preprocessing import OneHotEncoder

In [2]:
onehotencoding=OneHotEncoder(categorical_features=[-1]) """ -1 is the number of column whose 
features we want to label to 1 or 0 using one hot encoder. """

In [53]:
X_categorical=onehotencoding.fit_transform(X)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [54]:
X=X_categorical.toarray()

##### One hot encoding results in Dummy variable trap .
##### To avoid Dummy variable trap we can remove one of the columns of dummy variables

In [58]:
type(X)

numpy.ndarray

In [69]:
# This removes one dummy variable to avoid one dummy variable trap.
X=X[:,1:]

In [79]:
X[0:6]

array([[0.0000000e+00, 1.0000000e+00, 1.6534920e+05, 1.3689780e+05,
        4.7178410e+05],
       [0.0000000e+00, 0.0000000e+00, 1.6259770e+05, 1.5137759e+05,
        4.4389853e+05],
       [1.0000000e+00, 0.0000000e+00, 1.5344151e+05, 1.0114555e+05,
        4.0793454e+05],
       [0.0000000e+00, 1.0000000e+00, 1.4437241e+05, 1.1867185e+05,
        3.8319962e+05],
       [1.0000000e+00, 0.0000000e+00, 1.4210734e+05, 9.1391770e+04,
        3.6616842e+05],
       [0.0000000e+00, 1.0000000e+00, 1.3187690e+05, 9.9814710e+04,
        3.6286136e+05]])

In [70]:
from sklearn.model_selection import train_test_split

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [72]:
X_train.shape

(40, 5)

In [73]:
X_test.shape

(10, 5)

In [74]:
from sklearn.linear_model import LinearRegression

In [75]:
model=LinearRegression()

In [76]:
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

#### Weights of the other coefficients are given using coef function

In [77]:
model.coef_

array([ 9.38793006e+02,  6.98775997e+00,  8.05630064e-01, -6.87878823e-02,
        2.98554429e-02])

### What about the weight of 3rd state?

#### That weight is given by the bias. Bias is the model intercept and it also acts as balencing factor when other weights are zero.

In [82]:
model.intercept_ #Bias

54028.039594058944

In [88]:
y_pred=model.predict(X_test)

In [89]:
y_pred

array([126362.87908252,  84608.45383643,  99677.49425155,  46357.46068582,
       128750.48288497,  50912.41741905, 109741.350327  , 100643.24281644,
        97599.275746  , 113097.42524437])

In [84]:
y_test

13    134307.35
39     81005.76
30     99937.59
45     64926.08
17    125370.37
48     35673.41
26    105733.54
25    107404.34
32     97427.84
19    122776.86
Name: Profit, dtype: float64

In [85]:
from sklearn.metrics import mean_absolute_error

In [90]:
mean_absolute_error(y_test,y_pred)

6961.477813275563