Pixeltests School Data Science

*Unit 2, Sprint 2, Module 1*

---

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/pixeltests/datasets/main/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Decision Trees

This week, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or needs repair.

Dataset source: [DrivenData.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

## Directions

The tasks for this project are as follows:

- **Task 1:** Enter the [Kaggle](https://www.kaggle.com/t/6169ee7701164d24943c98eda2de9b5e) competition using exactly this link!
- **Task 2:** Use `wrangle` function to import training and test data.
- **Task 3:** Split training data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline accuracy score for your dataset.
- **Task 6:** Build and train `model_dt`.
- **Task 7:** Calculate the training and validation accuracy score for your model.
- **Task 8:** Adjust model's `max_depth` to reduce overfitting.
- **Task 9 `stretch goal`:** Create a horizontal bar chart showing the 10 most important features for your model.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# Kaggle

**Task 1:** Enter the [Kaggle](https://www.kaggle.com/t/6169ee7701164d24943c98eda2de9b5e) competition using exactly this link! **We recommend that you choose a username that's based on your name, since you might include it in your resume in the future.**. Go to the **Rules** page. Accept the rules of the competition. Notice that the **Rules** page also has instructions for the Submission process. The **Data** page has feature definitions.

# I. Wrangle Data

In [2]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path,
                                  na_values=[0, -2.000000e-08]),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path,
                         na_values=[0, -2.000000e-08],
                         index_col='id')

    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(15).T.duplicated().index
                 if df.head(15).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)

    return df

**Task 1:** Using the `wrangle` function above, read the `train_features.csv` and  `train_labels.csv` files into the DataFrame `df`. Next, use the same function to read the test set `test_features.csv` into the DataFrame `X_test`.

In [3]:
import pandas as pd
#df = wrangle (DATA_PATH+'/waterpumps/train_features.csv', DATA_PATH+'waterpumps/train_labels.csv')
#X_test = wrangle(DATA_PATH+'waterpumps/test_features.csv')

#X_test.shape
from google.colab import files
uploaded = files.upload()

Saving tanzania sample solution.csv to tanzania sample solution.csv
Saving test_features.csv to test_features.csv
Saving train_features.csv to train_features.csv
Saving train_labels.csv to train_labels.csv


In [4]:
def preprocess(df):

  df.drop('id',axis = 1,inplace = True)
  #Dropping the recorded by column

  df.drop('recorded_by',axis = 1,inplace = True)

  #Engineering 'pump-age'

  #df['pump_age'] = 2023 - df['construction_year']

  #Dropping the HCC's

  cutoff = 100

  drop_cols = [col for col in df.select_dtypes('object').columns
             if df[col].nunique() > cutoff]
  df.drop(drop_cols , inplace = True , axis = 1)

  #Dropping the duplicate columns

  dupe_cols = [col for col in df.head(15).T.duplicated().index
             if df.head(15).T.duplicated()[col]]
  df.drop(dupe_cols , inplace = True , axis = 1)

  return df

In [14]:
import io

X_test1 = pd.read_csv(io.BytesIO(uploaded['test_features.csv']))
X_test = preprocess(X_test1)

df1 = pd.read_csv(io.BytesIO(uploaded['train_features.csv']))
df2 = pd.read_csv(io.BytesIO(uploaded['train_labels.csv']))
df = pd.concat([df1, df2], axis=1)

df = preprocess(df)
X_test.head()
print(df.shape, X_test.shape)

(47520, 30) (11880, 29)


# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
# The status_group column is the target

x = df.drop('status_group', axis =1)
y = df['status_group']
print(x.shape, y.shape)

(47520, 29) (47520,)


**Task 4:** Using a randomized split, divide `X` and `y` into a training set (`X_train`, `y_train`) and a validation set (`X_val`, `y_val`).

In [13]:
# Arrange data into X features matrix and y target vector
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(x, y, test_size=0.2)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(38016, 29) (9504, 29) (38016,) (9504,)


# III. Establish Baseline

**Task 5:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [None]:
baseline_acc = ...
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: Ellipsis


# IV. Build Model

**Task 6:** Build a `Pipeline` named `model_dt`, and fit it to your training data. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` predictor.

**Note:** Don't forget to set the `random_state` parameter for your `DecisionTreeClassifier`.

In [15]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

#set up a pipeline
model_dt = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    DecisionTreeClassifier(max_depth = 5, random_state=91)
    )
#fit on train
model_dt.fit(X_train, y_train)

#predict on test
y_pred= model_dt.predict(X_test)
y_pred

data = {
    "S.No.": range(1, 1 + len(X_test)),
    "status_group": y_pred
}
submission = pd.DataFrame(data)
#encode status_group
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

submission['status_group'] = le.fit_transform(submission['status_group'] )
submission['status_group']
#submission = pd.DataFrame(y_pred, index=ids)
submission.to_csv('submission-07.csv', index=False)

# V. Check Metrics

**Task 7:** Calculate the training and validation accuracy scores for `model_dt`.

In [16]:
training_acc = model_dt.score(X_train, y_train)
val_acc = model_dt.score(X_val, y_val)
#score on train, val

print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

Training Accuracy Score: 0.7144623316498316
Validation Accuracy Score: 0.7104377104377104


# VI. Tune Model

**Task 8:** Is there a large difference between your training and validation accuracy? If so, experiment with different setting for `max_depth` in your `DecisionTreeClassifier` to reduce the amount of overfitting in your model.

In [18]:
# Use this cell to experiment and then change
# your model hyperparameters in Task 6

model_dt = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    DecisionTreeClassifier(random_state=91, max_depth =3)
    )
#fit on train
model_dt.fit(X_train, y_train)

training_acc = model_dt.score(X_train, y_train)
val_acc = model_dt.score(X_val, y_val)
#score on train, val
print('max_depth = 3')
print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

model_dt = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    DecisionTreeClassifier(random_state=91, max_depth =5)
    )
#fit on train
model_dt.fit(X_train, y_train)

training_acc = model_dt.score(X_train, y_train)
val_acc = model_dt.score(X_val, y_val)
#score on train, val
print('max_depth = 4')
print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

max_depth = 3
Training Accuracy Score: 0.6946548821548821
Validation Accuracy Score: 0.6912878787878788
max_depth = 4
Training Accuracy Score: 0.7144623316498316
Validation Accuracy Score: 0.7104377104377104


# VII. Communicate Results

**Task 9 `stretch goal`:** Create a horizontal barchart that shows the the 10 most important features for model_dt, sorted by value.

**Note:** [`DecisionTreeClassifier.feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreecla#sklearn.tree.DecisionTreeClassifier.feature_importances_) returns values that are different from [`LogisticRegression.coef_`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). All the values will be positive, and they will sum to `1`.

Submissions
You selected 0 of 1 submissions to be evaluated for your final leaderboard score. Since you selected less than 1 submission, Kaggle auto-selected up to 1 submissions from among your public best-scoring unselected submissions for evaluation. The evaluated submission with the best Private Score is used for your final score.

Submissions evaluated for final score

0/1

All

Successful

Selected

Errors

Recent
Submission and Description

Private Score

Public Score

Selected

submission-07 (1).csv
Complete (after deadline) Â· now
0.70344

0.70344

