# Decision Trees
You should build a machine learning pipeline using a decision tree model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Conduct data exploration, data preprocessing, and feature engineering if necessary. 
- Train and test a decision tree model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

#### IMporting Libraries

In [None]:
import pandas as pd
import sklearn.model_selection
import sklearn.compose
import sklearn.preprocessing
import sklearn.tree
import sklearn.metrics
from sklearn import tree
import matplotlib.pyplot as plt

#### *Defining the Business Problem*

#### Data Collection

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/m-mahdavi/teaching/main/datasets/mnist.csv')
df.head()

In [None]:
x = df.drop(['id','class'],axis=1)
y = df['class']

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x,y)

print("df size:",df.shape)
print("df train size:",x_train.shape)
print("df test size:",x_test.shape)

#### Data Exploration

In [None]:
x_train.dtypes

In [None]:
y_train.value_counts().plot(kind='bar')

In [None]:
x_train.isnull().sum()

In [None]:
x_train.sum()

#### Data Pre-processing

In [None]:
## for example you can find and remove constant feature from your data

#### feature Eng

In [None]:
numerical_attributes = x_train.select_dtypes(include="int64").columns.tolist()

ct = sklearn.compose.ColumnTransformer([('standard_scaling', sklearn.preprocessing.StandardScaler(),numerical_attributes)],remainder='passthrough')
ct.fit(x_train)
x_train = ct.transform(x_train)
x_test = ct.transform(x_test)

#### Model Training

In [None]:
model = sklearn.tree.DecisionTreeClassifier(criterion='gini',max_depth=10)
model.fit(x_train,y_train)

#### Model Assesment

In [None]:
y_predicted = model.predict(x_test)
result = sklearn.metrics.classification_report(y_test,y_predicted)
print(result)

#### Visualization

In [None]:
# Using matplotlib
fig = plt.figure(figsize=(25,20))
tree.plot_tree(model,filled=True, rounded=True,max_depth= 1)
plt.show()

In [None]:
# Using graphviz
dot_data = export_graphviz(model, out_file=None, 
                    #  feature_names=df_train.feature_names,  
                    #  class_names=iris.target_names,  
                     filled=True, rounded=True,  max_depth=4,
                     special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)
graphviz.Source(graph.to_string())