# What is a Decision Tree?

 - A decision tree is simply a set of cascading questions
 - When you get a data point (i.e. set of features and values), you use each attribute (i.e. a value of a given feature of the data point) to answer a question
 - The answer to each question decides the next question
 - At the end of this sequence of questions, you will end up with a probability of the data point belonging to each class
 - Decision tree is a type of supervised learning algorithm (having a predefined target variable) that is mostly used in classification problemS
 - It works for both categorical and continuous input and output variables.

<img src = https://www.analyticsvidhya.com/wp-content/uploads/2016/04/dt.png height = 400, width = 400>

 - Let’s say we have a sample of 30 students with three variables,
     - Gender (Boy/ Girl)
     - Class( IX/ X) and 
     - Height (5 to 6 ft) 
 - 15 out of these 30 play cricket in leisure time Now, 
 - I want to create a model to predict who will play cricket during leisure period? 
 - In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.
 - This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students
 
<img src = https://www.analyticsvidhya.com/wp-content/uploads/2015/01/Test.png>

 - Decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population
 - Now the question which arises is, how does it identify the variable and the split? 

# Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:

    - Categorical Variable Decision Tree
    - Continuous Variable Decision Tree

**Categorical Variable Decision Tree:** 
   - Decision Tree which has categorical target variable then it called as categorical variable decision tree

**Continuous Variable Decision Tree:** 
   - Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree

# Important Terminology related to Tree based Algorithms

Let’s look at the basic terminology used with Decision trees:

**Root Node**: It represents entire population or sample and this further gets divided into two or more homogeneous sets

**Splitting:** It is a process of dividing a node into two or more sub-nodes

**Decision Node:** When a sub-node splits into further sub-nodes, then it is called decision node

**Leaf/ Terminal Node:** Nodes do not split is called Leaf or Terminal node

<img src= https://www.analyticsvidhya.com/wp-content/uploads/2015/01/Decision_Tree_2.png height = 400, width = 400>

**Branch / Sub-Tree:** A sub section of entire tree is called branch or sub-tree

**Parent and Child Node:** A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node

# Advantages

**Easy to Understand:**
 - Decision tree output is very easy to understand even for people from non-analytical background
 - It does not require any statistical knowledge to read and interpret them

**Less data cleaning required:**
 - It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree
 
**Data type is not a constraint:**
 - It can handle both numerical and categorical variables.

**Non Parametric Method:**
 - Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
 

# Disadvantages

**Over fitting:**
 - Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below)
 
**Not fit for continuous variables:**
 - While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories

# How does a tree based algorithms decide where to split?

The decision of making strategic splits heavily affects a tree’s accuracy

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes

Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes

**The algorithm selection is also based on type of target variables. Let’s look at most commonly used algorithms in decision tree:**

## **Gini**

Gini  says, if we select two items from a population at random then they must be of same class

 - It works with categorical target variable “Success” or “Failure”
 - It performs only Binary splits
 - Higher the value of Gini higher the homogeneity

**Steps to Calculate Gini for a split**

we split the population using two input variables Gender and Class. Now, I want to identify which split is producing more homogeneous sub-nodes using Gini

<img src = https://www.analyticsvidhya.com/wp-content/uploads/2015/01/Decision_Tree_Algorithm1.png>

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2)

**Split on Gender:**

 - Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
 - Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
 - Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

**Similar for Split on Class:**

 - Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
 - Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
 - Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender

## Information Gain:
Look at the image below and think which node can be described easily

<img src = https://www.analyticsvidhya.com/wp-content/uploads/2015/01/Information_Gain_Decision_Tree2.png>

The answer is C because it requires less information as all values are similar

On the other hand, B requires more information to describe it and A requires the maximum information

In other words, we can say that C is a Pure node, B is less Impure and A is more impure.

Now, we can build a conclusion that less impure node requires less information to describe it. And, more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy

**Entropy can be calculated using formula:**

<img src = https://www.analyticsvidhya.com/wp-content/uploads/2015/01/Entropy_Formula.png>

Here p and q is probability of success and failure respectively in that node

It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.

#### **Steps to calculate entropy for a split:**

 - Calculate entropy of parent node
 - Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.

Example: Let’s use this method to identify best split for student example

<img src = https://www.analyticsvidhya.com/wp-content/uploads/2015/01/Decision_Tree_Algorithm1.png>

* Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1

Here 1 shows that it is a impure node

 - Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72
 - Entropy for male node = -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
 - Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93 = 0.86
 
 
 - Entropy for Class IX node = -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99
 - Entropy for Class X node = -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99
 - Entropy for split Class =  (14/30)*0.99 + (16/30)*0.99 = 0.99

Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split on Gender. We can derive information gain from entropy as 1- Entropy

# Are tree based algorithms better than linear models?

“If I can use logistic regression for classification problems and linear regression for regression problems, why is there a need to use trees”? Many of us have this question. And, this is a valid one too.

Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key factors which will help you to decide which algorithm to use:

- If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model
- If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method
- If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!

In [3]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

In [8]:
df = pd.read_csv(r"/Users/shreyansengupta/Downloads/Py/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
len(df)

768

In [10]:
df1 = df.copy()

In [11]:
len(df1)

768

In [8]:
df1.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [10]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [11]:
df1.count()

Pregnancies                 768
Glucose                     768
BloodPressure               768
SkinThickness               768
Insulin                     768
BMI                         768
DiabetesPedigreeFunction    768
Age                         768
Outcome                     768
dtype: int64

In [12]:
df1.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [17]:
df1.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [13]:
df1[df1['Glucose'] == 0]
len(df1[df1['Glucose'] == 0])

5

In [14]:
df1[df1['BloodPressure'] == 0]
len(df1[df1['BloodPressure'] == 0])

35

In [15]:
df1[df1['SkinThickness'] == 0]
len(df1[df1['SkinThickness'] == 0])

227

In [16]:
df1[df1['BMI'] == 0]
len(df1[df1['BMI'] == 0])

11

In [17]:
df1[df1['Insulin'] == 0]
len(df1[df1['Insulin'] == 0])

374

In [18]:
zero_columns = ['Glucose','BloodPressure', 'SkinThickness', 'BMI', 'Insulin']

In [19]:
for column in zero_columns:
    mean = df1[column].mean(skipna = True)
    df1[column] = df1[column].replace(0, mean)

In [20]:
df1.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [21]:
len(df1.columns)

9

In [22]:
df1.iloc[:, 0:-1]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72.000000,35.000000,79.799479,33.600000,0.627,50
1,1,85.0,66.000000,29.000000,79.799479,26.600000,0.351,31
2,8,183.0,64.000000,20.536458,79.799479,23.300000,0.672,32
3,1,89.0,66.000000,23.000000,94.000000,28.100000,0.167,21
4,0,137.0,40.000000,35.000000,168.000000,43.100000,2.288,33
5,5,116.0,74.000000,20.536458,79.799479,25.600000,0.201,30
6,3,78.0,50.000000,32.000000,88.000000,31.000000,0.248,26
7,10,115.0,69.105469,20.536458,79.799479,35.300000,0.134,29
8,2,197.0,70.000000,45.000000,543.000000,30.500000,0.158,53
9,8,125.0,96.000000,20.536458,79.799479,31.992578,0.232,54


In [23]:
X=df1[df1.columns[0:-1]]
Y=df1[df1.columns[-1]]

In [26]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.20)

In [27]:
DecisionTreeClassifier?

In [28]:
tree = DecisionTreeClassifier()
tree

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [29]:
tree.fit(X_train,y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train,y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test,y_test)))

Accuracy on training set: 1.000
Accuracy on test set: 0.734


In [30]:
predictions = tree.predict(X_test)

In [31]:
confusion_matrix(y_test, predictions)

array([[82, 23],
       [18, 31]], dtype=int64)

In [34]:
(84+30)/(84+30+28+12)

0.7402597402597403

In [35]:
accuracy_score(y_test, predictions)

0.7337662337662337

In [36]:
X.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

In [37]:
df1['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [38]:
tree.predict([[1,2,3,4,5,6,7,8],[10,20,30,40,55,4,32,1]])

array([0, 1], dtype=int64)

# What is Random Forest ? How does it work?

Random Forest is considered to be a panacea of all data science problems. On a funny note, when you can’t think of any algorithm (irrespective of situation), use random forest!

Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

### How does it work?

Random Forest has multiple trees

Each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes

**Advantages of Random Forest**

One of benefits of Random forest which excites me most is, the power of handle large data set with higher dimensionality

It can handle thousands of input variables and identify most significant variables so it is considered as one of the dimensionality reduction methods

Further, the model outputs Importance of variable, which can be a very handy feature (on some random data set).

<img src = https://www.analyticsvidhya.com/wp-content/uploads/2015/09/Variable_Important.png>

In [39]:
from sklearn.ensemble import RandomForestClassifier

In [40]:
df = pd.read_csv(r"C:\Users\Cyntexia\Downloads\New folder\archive\diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [41]:
len(df)

768

In [42]:
df1 = df.copy()

In [43]:
len(df1)

768

In [44]:
df1.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [46]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [48]:
df1.count()

Pregnancies                 768
Glucose                     768
BloodPressure               768
SkinThickness               768
Insulin                     768
BMI                         768
DiabetesPedigreeFunction    768
Age                         768
Outcome                     768
dtype: int64

In [49]:
df1.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [50]:
df1.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [51]:
df1[df1['Glucose'] == 0]
len(df1[df1['Glucose'] == 0])

5

In [52]:
df1[df1['BloodPressure'] == 0]
len(df1[df1['BloodPressure'] == 0])

35

In [53]:
df1[df1['SkinThickness'] == 0]
len(df1[df1['SkinThickness'] == 0])

227

In [54]:
df1[df1['BMI'] == 0]
len(df1[df1['BMI'] == 0])

11

In [55]:
df1[df1['Insulin'] == 0]
len(df1[df1['Insulin'] == 0])

374

In [56]:
zero_columns = ['Glucose','BloodPressure', 'SkinThickness', 'BMI', 'Insulin']

In [57]:
for column in zero_columns:
    mean = df1[column].mean(skipna = True)
    df1[column] = df1[column].replace(0, mean)

In [58]:
df1.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [59]:
len(df1.columns)

9

In [60]:
X=df1[df1.columns[0:-1]]

Y=df1[df1.columns[-1]]

In [61]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.20)

In [62]:
tree = RandomForestClassifier()
tree

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [63]:
tree.fit(X_train,y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train,y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test,y_test)))

Accuracy on training set: 0.985
Accuracy on test set: 0.721




In [64]:
predictions = tree.predict(X_test)

In [65]:
confusion_matrix(y_test, predictions)

array([[92, 13],
       [30, 19]], dtype=int64)

In [66]:
accuracy_score(y_test, predictions)

0.7207792207792207

In [67]:
tree.predict([[1,2,3,4,5,6,7,8]])

array([0], dtype=int64)

In [68]:
# Get numerical feature importances
importances = list(tree.feature_importances_)
importances

[0.11237535986425154,
 0.25980811493089023,
 0.086271093402944,
 0.07570923446176994,
 0.07657961190248905,
 0.15757973959388066,
 0.11794506238578831,
 0.11373178345798632]

In [69]:
# Saving feature names 
feature_list = list(df1.columns)

In [70]:
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
feature_importances

[('Pregnancies', 0.11),
 ('Glucose', 0.26),
 ('BloodPressure', 0.09),
 ('SkinThickness', 0.08),
 ('Insulin', 0.08),
 ('BMI', 0.16),
 ('DiabetesPedigreeFunction', 0.12),
 ('Age', 0.11)]

In [71]:
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
feature_importances

[('Glucose', 0.26),
 ('BMI', 0.16),
 ('DiabetesPedigreeFunction', 0.12),
 ('Pregnancies', 0.11),
 ('Age', 0.11),
 ('BloodPressure', 0.09),
 ('SkinThickness', 0.08),
 ('Insulin', 0.08)]

In [72]:
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: Glucose              Importance: 0.26
Variable: BMI                  Importance: 0.16
Variable: DiabetesPedigreeFunction Importance: 0.12
Variable: Pregnancies          Importance: 0.11
Variable: Age                  Importance: 0.11
Variable: BloodPressure        Importance: 0.09
Variable: SkinThickness        Importance: 0.08
Variable: Insulin              Importance: 0.08


# Decision Tree Regressor

In [73]:
from sklearn.tree import DecisionTreeRegressor

In [12]:
df = pd.read_csv(r"/Users/shreyansengupta/Downloads/Py/1000_Companies.csv")
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [77]:
df['State'] = df['State'].astype('category')
df['State'] = df['State'].cat.codes
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,2,192261.83
1,162597.7,151377.59,443898.53,0,191792.06
2,153441.51,101145.55,407934.54,1,191050.39
3,144372.41,118671.85,383199.62,2,182901.99
4,142107.34,91391.77,366168.42,1,166187.94


In [78]:
x = df.iloc[:,:-1] # independent variable
y = df.iloc[:,4] # dependent variable

In [79]:
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.20)
len(x_train),len(x_test),len(y_train),len(y_test)

(800, 200, 800, 200)

In [80]:
dtr = DecisionTreeRegressor()
dtr.fit(x_train,y_train)
y_pred = dtr.predict(x_test)

In [81]:
y_pred

array([171176.9165 ,  89012.02672, 165530.0505 , 116260.5043 ,
       144807.5047 , 154901.8116 ,  67226.74247, 181946.9678 ,
       158338.6258 , 100589.3834 , 162440.0786 ,  73850.06347,
       168246.6971 , 179522.4889 , 173897.8345 , 169673.3637 ,
       133834.1321 , 172615.5432 , 163155.1205 ,  88592.56966,
        65743.69265, 128195.809  , 148158.0355 , 103378.6447 ,
        91640.68127,  58963.18204, 132897.8287 , 101971.6268 ,
        57992.70704, 168459.4156 , 128986.8828 , 105733.54   ,
        98444.25774, 179522.4889 ,  87045.44798, 175771.2955 ,
       161783.1286 , 101028.4891 , 180684.3252 ,  80082.11902,
       155915.0011 , 180783.423  , 165119.1364 , 176839.1597 ,
        63814.70273, 152243.2568 , 159247.5918 , 105503.2673 ,
       182098.1774 , 101971.6268 ,  76229.26494, 144184.7263 ,
        71376.03566, 184887.4387 , 157493.7316 ,  85830.64565,
        59766.21593, 126993.8211 ,  97955.60308, 174320.7087 ,
       162440.0786 , 172937.611  , 115980.2967 , 130689

In [82]:
y_test

261    170883.89450
407     88971.02073
624    165495.02460
280    116273.31870
696    144820.31910
390    154765.97920
83      67473.63267
595    181929.02770
700    158565.86730
529    100556.06600
276    162479.37600
483     73872.27504
665    168285.14020
106    178759.60670
576    174007.18380
210    169661.40360
519    133617.14210
632    172769.31560
778    163028.68540
721     88481.51178
503     65664.24355
212    128265.00660
874    148134.11530
650    103404.27340
938     91623.59544
38      81229.06000
93     132915.76890
107    101668.35340
219     58223.36571
722    168402.17810
           ...     
202    138872.74260
944    146690.36290
365     54244.93080
159    119743.45020
913     54932.63535
738    123690.27630
508    139555.32140
747    164884.20620
3      182901.99000
908    180378.48880
108    151782.79380
160    103155.67470
571    184581.60240
971    110395.79400
263    110023.32300
357    152920.70990
43      69758.98000
854    164695.40790
0      192261.83000


In [83]:
from sklearn.metrics import mean_squared_error, r2_score# evaluation metrics
r2_score(y_test,y_pred)

0.9942883250712671

In [84]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(x_train,y_train)
y_pred = rfr.predict(x_test)



In [85]:
y_pred

array([170988.03266 ,  88880.892994, 165495.70799 , 116297.49511 ,
       144865.4257  , 154922.74171 ,  67271.934482, 181970.20453 ,
       158352.97789 , 100527.8744  , 162551.81994 ,  73947.110968,
       168175.19289 , 178678.44902 , 173951.82574 , 169659.01159 ,
       133840.28298 , 172834.24177 , 163214.5792  ,  88615.122952,
        65710.460714, 128252.53396 , 148149.49259 , 103330.71896 ,
        91647.515598,  67619.59789 , 132866.81793 , 101704.06276 ,
        58094.453142, 168460.8679  , 128974.15386 , 110244.18473 ,
        98551.642169, 170348.56675 ,  87035.196484, 175778.81325 ,
       161693.42797 , 101059.58534 , 180704.14476 ,  79991.564134,
       155980.69611 , 180763.60344 , 165139.89567 , 176939.02637 ,
        63755.41491 , 152181.24746 , 159245.2852  , 105477.724   ,
       182134.65564 ,  99141.786635,  76346.473713, 144184.21374 ,
        71473.368514, 184901.53451 , 157347.64776 ,  85832.952237,
        59769.291378, 126921.7189  ,  97876.666557, 174347.533

In [86]:
y_test

261    170883.89450
407     88971.02073
624    165495.02460
280    116273.31870
696    144820.31910
390    154765.97920
83      67473.63267
595    181929.02770
700    158565.86730
529    100556.06600
276    162479.37600
483     73872.27504
665    168285.14020
106    178759.60670
576    174007.18380
210    169661.40360
519    133617.14210
632    172769.31560
778    163028.68540
721     88481.51178
503     65664.24355
212    128265.00660
874    148134.11530
650    103404.27340
938     91623.59544
38      81229.06000
93     132915.76890
107    101668.35340
219     58223.36571
722    168402.17810
           ...     
202    138872.74260
944    146690.36290
365     54244.93080
159    119743.45020
913     54932.63535
738    123690.27630
508    139555.32140
747    164884.20620
3      182901.99000
908    180378.48880
108    151782.79380
160    103155.67470
571    184581.60240
971    110395.79400
263    110023.32300
357    152920.70990
43      69758.98000
854    164695.40790
0      192261.83000


In [87]:
r2_score(y_test,y_pred)

0.9936566484967649