![iut](stock_image\IUT200.png)
<hr style="margin-bottom: 40px;">


# STEM Jobs Salaries - Decission Tree

## Decision Tree

#### Classification is a fundamental task in machine learning and data analysis that involves categorizing data into predefined classes or categories based on their features or attributes. It is a supervised learning technique where the goal is to train a model on labeled training data to make accurate predictions on unseen or test data.

[Link to the Data used in this Notebook](https://drive.google.com/file/d/1IhXv0qcq7YFfBxc0BQB1-z74wF40ZnZn/view?usp=share_link)


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Decission Tree - Importing Modules



In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Loading The Initial Data:


In [None]:
!head data/STEMJobs.csv
# Note: incase if you are running this line locally you will be met with the error below
# as this notebook is meant to be executed in Google Colab

In [None]:
Data = pd.read_csv(
    'Data/STEM.csv',
    parse_dates=['Timestamp'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Data First Look:

In [None]:
Data.head()

In [None]:
Data.info()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Data Types


In [None]:
Data.dtypes

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Data Preparation


In [None]:
Data.head()

In [None]:
new_data = Data

new_data = new_data.drop(['Timestamp', 'IsUS', 'IsUS', 'IsCA', 'IsID', 'IsIN', 'IsDE', 
                          'Tag', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree'], axis=1)

In [None]:
new_data.head()

In [None]:
new_data.corr()

In [None]:
x = new_data
x = x.drop(['TotalYearlyCompensation', 'BaseSalary'], axis=1)

In [None]:
y = new_data.BaseSalary

In [None]:
x

In [None]:
print(x.Gender.value_counts())

In [None]:
len(x.Education.unique())

In [None]:
len(x.Country.unique())

In [None]:
len(x.Title.unique())

In [None]:
len(x.Level.unique())

In [None]:
len(x.Company.unique())

In [None]:
x = x.drop(['Level', 'Company'], axis=1)

In [None]:
pd.get_dummies(x)

In [None]:
x = pd.get_dummies(x)

In [None]:
x

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test =train_test_split(x, y, train_size=0.8)

In [None]:
x_train

In [None]:
y_train

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Sklearn Module - Decession Tree Regressor


In [None]:
import sklearn

from sklearn.tree import DecisionTreeRegressor

In [None]:
# List of criterion methods
criterion_methods = ['squared_error', 'friedman_mse', 'poisson']

# Iterate over each criterion method
for method in criterion_methods:
    # Create the DecisionTreeRegressor with the current criterion method
    DT = DecisionTreeRegressor(criterion=method, max_depth=9, min_samples_split=2, random_state=0)
    
    # Fit the model on the training data
    DT.fit(x_train, y_train)
    
    # Calculate the score on the test data
    score = DT.score(x_test, y_test)
    
    # Print the score for the current criterion method
    print(f"Criterion: {method}")
    print(f"Score: {score}\n")


In [None]:
# List of criterion methods
criterion_methods = ['squared_error', 'friedman_mse', 'poisson']
max_depths = range(9, 21)
scores = []

# Iterate over each criterion method
for method in criterion_methods:
    criterion_scores = []
    
    # Iterate over each max_depth value
    for max_depth in max_depths:
        # Create the DecisionTreeRegressor with the current criterion method and max_depth
        DT = DecisionTreeRegressor(criterion=method, max_depth=max_depth, min_samples_split=2, random_state=0)
        
        # Fit the model on the training data
        DT.fit(x_train, y_train)
        
        # Calculate the score on the test data
        score = DT.score(x_test, y_test)
        criterion_scores.append(score)
    
    # Print the scores for the current criterion method
    print(f"Criterion: {method}")
    print("Scores:", criterion_scores)
    print()
    
    # Store the scores for the current criterion method
    scores.append(criterion_scores)

# Plot the scores for each criterion method
for i, method in enumerate(criterion_methods):
    plt.plot(max_depths, scores[i], label=method)

plt.xlabel('max_depth')
plt.ylabel('Score')
plt.title('DecisionTreeRegressor Scores for Different Criteria')
plt.legend()
plt.show()

In [None]:
DT = DecisionTreeRegressor(criterion='poisson', max_depth=5, min_samples_split=2, random_state=0)

In [None]:
DT.fit(x_train, y_train)

In [None]:
DT.score(x_test, y_test)

In [None]:
importance_scores = zip(x.columns, DT.feature_importances_)
sorted_importances = sorted(importance_scores, key=lambda x: x[1], reverse=True)

for feature, importance in sorted_importances:
    print(f"{feature}: {importance}")

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Sklearn Module - Random Forest Regressor


In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
forest = RandomForestRegressor(n_estimators=500, max_depth=21, max_features="sqrt", bootstrap=False, n_jobs=-1)

In [None]:
forest.fit(x_train, y_train)

In [None]:
forest.score(x_test, y_test)

In [None]:
n_estimators = 500
max_depths = range(10, 35)
max_features = "sqrt"
bootstrap = False
n_jobs = -1

scores = []

for max_depth in max_depths:
    forest = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth,
                                   max_features=max_features, bootstrap=bootstrap, n_jobs=n_jobs)

    forest.fit(x_train, y_train)
    score = forest.score(x_test, y_test)
    scores.append(score)

plt.plot(max_depths, scores)
plt.xlabel('max_depth')
plt.ylabel('Score')
plt.title('RandomForestRegressor Scores for Different max_depth')
plt.show()

In [None]:
importance_scores = zip(x.columns, forest.feature_importances_)
sorted_importances = sorted(importance_scores, key=lambda x: x[1], reverse=True)

for feature, importance in sorted_importances:
    print(f"{feature}: {importance}")

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Plotting The Trees


In [None]:
from sklearn import tree

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## DT Tree


In [None]:
fig = plt.figure(figsize=(100, 100))
tree.plot_tree(DT, filled=True)
fig.savefig("DT Tree.png")

In [None]:
from sklearn.tree import export_text

tree_text = export_text(DT)

print(tree_text)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Forest Tree


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

tree_to_visualize = forest.estimators_[0]

plt.figure(figsize=(10, 10))
tree.plot_tree(tree_to_visualize, filled=True)
plt.show()

In [None]:
fig.savefig("DT Tree.png")

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import r2_score

# Create an instance of AdaBoostRegressor
ada_boost = AdaBoostRegressor()

# Fit the model on the training data
ada_boost.fit(x_train, y_train)

# Make predictions on the test data
y_pred = ada_boost.predict(x_test)

# Calculate the score using R-squared metric
score = r2_score(y_test, y_pred)

# Print the score
print("Score:", score)


In [None]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

# Instantiate the BaggingRegressor with specified parameters
bagging = BaggingRegressor(
                           n_estimators=50,
                           max_samples=0.1,
                           max_features=0.9,
                           bootstrap=False,
                           random_state=0)

# Fit the model on the training data
bagging.fit(x_train, y_train)

# Calculate the score on the test data
score = bagging.score(x_test, y_test)

# Print the score
print("BaggingRegressor Score:", score)



In [None]:
from sklearn.ensemble import ExtraTreesRegressor

# Instantiate the ExtraTreesRegressor with specified parameters
extra_trees = ExtraTreesRegressor(n_estimators=50,
                                  criterion='squared_error',
                                  max_depth=None,
                                  min_samples_split=2,
                                  min_samples_leaf=1,
                                  max_features='auto',
                                  bootstrap=False,
                                  random_state=0)

# Fit the model on the training data
extra_trees.fit(x_train, y_train)

# Calculate the score on the test data
score = extra_trees.score(x_test, y_test)

# Print the score
print("ExtraTreesRegressor Score:", score)


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate the GradientBoostingRegressor with specified parameters
gradient_boosting = GradientBoostingRegressor(loss='squared_error',
                                              learning_rate=0.1,
                                              n_estimators=200,
                                              subsample=0.8,
                                              criterion='squared_error',
                                              min_samples_split=2,
                                              min_samples_leaf=1,
                                              max_depth=5,
                                              max_features=None,
                                              random_state=0)

# Fit the model on the training data
gradient_boosting.fit(x_train, y_train)

# Calculate the score on the test data
score = gradient_boosting.score(x_test, y_test)

# Print the score
print("GradientBoostingRegressor Score:", score)

In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

# Instantiate the base estimators
estimators = [
    ('linear', LinearRegression()),
    ('tree', DecisionTreeRegressor()),
    ('knn', KNeighborsRegressor())
]

# Instantiate the StackingRegressor with the base estimators and meta-estimator
stacking_reg = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())

# Fit the model on the training data
stacking_reg.fit(x_train, y_train)

# Predict using the StackingRegressor
y_pred = stacking_reg.predict(x_test)

# Evaluate the model's performance
score = stacking_reg.score(x_test, y_test)
print("StackingRegressor Score:", score)


In [None]:
from sklearn.ensemble import HistGradientBoostingRegressor

# Instantiate the HistGradientBoostingRegressor
hist_gb_reg = HistGradientBoostingRegressor()

# Fit the model on the training data
hist_gb_reg.fit(x_train, y_train)

# Predict using the HistGradientBoostingRegressor
y_pred = hist_gb_reg.predict(x_test)

# Evaluate the model's performance
score = hist_gb_reg.score(x_test, y_test)
print("HistGradientBoostingRegressor Score:", score)


## free me from this suffering


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
