# Usage

## Example Usage

Here's an example on how to use `dsci17pkg` on a hyperthyroid dataset.

### Imports

In [2]:
import group17pkg as grp
import numpy as pd

ModuleNotFoundError: No module named 'group17pkg'

### Creating the Data
For this example, we'll use a [Thyroid Disease dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/) from the UCI Machine Learning Repository

In [8]:
dfLink = 'http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/allhyper.data'
columnNames = ["age", "sex", "on thyroxine", "query on thyroxine",
               "on antithyroid medication", "sick", "pregnant",
               "thyroid surgery", "I131 treatment", "query hypothyroid",
               "query hyperthyroid", "lithium", "goitre", "tumor",
               "hypopituitary", "psych", "TSH measured", "TSH", "T3 measured",
               "T3", "TT4 measured", "TT4", "T4U measured", "T4U",
               "FTI measured", "FTI", "TBG measured", "TBG", "referral source",
               "binaryClass"]
hyperthyroid_df = pd.read_csv(dfLink, names=columnNames)

NameError: name 'pd' is not defined

### Cleaning Data
First, we'll use `relabel_bclass` on the dataframe to label the data properly. Within the hyperthyroid dataset, the labels are actually reversed where "P" refers to a negative diagnosis and "N" refers to a positive diagnosis. This function will reverse this.
Next, we will use `coldtype_reformat` to reformat the datatypes of specified columns.

In [3]:
hyperthyroid_df = grp.relabel_bcclass(hyperthyroid_df)

#replace ? values with NA
hyperthyroid = hyperthyroid_df.replace("?", np.nan)
#drop columns with no data
hyper = hyperthyroid.drop(columns=["TBG", "TBG measured", "T3", "T3 measured", "TSH measured",
                                   "TT4 measured", "FTI measured", "T4U measured", "referral source"])
hyper_clean = hyper.dropna()
# Changing Dtype of the columns to numeric/categorical
num_cols = ['age', 'TSH', 'TT4', 'T4U', 'FTI']
cat_cols = ['sex', 'on thyroxine', 'query on thyroxine', 'on antithyroid medication',
            'sick', 'pregnant', 'thyroid surgery', 'I131 treatment', 'query hypothyroid',
            'query hyperthyroid', 'lithium', 'goitre', 'tumor', 'psych', 'binaryClass', 'hypopituitary']

hyper_clean = grp.col_dtype_reformat(num_cols, cat_cols, hyper_clean)
# Change binaryClass column so 0 represents negative and 1 represents positive
hyper_clean['binaryClass'] = hyper_clean['binaryClass'].replace(["N", "P"], [1, 0])

NameError: name 'grp' is not defined

### EDA Graph Creation
Now that the data is clean, we can use other tools within the package to create graphs. First, we will use `plot_correlations` to see correlations between numerical values.

In [None]:
grp.plot_correlations(hyper_clean).show()

### Model Graph Creation
To view the other graph creating feature we have, we need to train a classifier.

In [None]:
# Splitting data
X = hyper_clean.drop(columns="binaryClass")
y = hyper_clean['binaryClass']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Preprocessing data
onehot = ['sex', 'on thyroxine', 'query on thyroxine', 'on antithyroid medication',
          'sick', 'pregnant', 'thyroid surgery', 'I131 treatment', 'query hypothyroid',
          'query hyperthyroid', 'lithium', 'goitre', 'tumor', 'psych', 'hypopituitary']
numeric = ['age', 'TSH', 'TT4', 'T4U', 'FTI']
ct = make_column_transformer(
    (StandardScaler(), numeric),
    (OneHotEncoder(handle_unknown='ignore'), onehot)
)
transformed_X_train = ct.fit_transform(X_train)
transformed_X_test = ct.transform(X_test)

# Creating LogisticRegression Classifier
pipe_log = make_pipeline(ct, LogisticRegression(max_iter=1000, C=1))
cv = cross_validate(pipe_log, X_train, y_train, error_score='raise', return_train_score=True)

lr = LogisticRegression(max_iter=1000, C=1)
X_train_trans = ct.fit_transform(X_train)
X_test_trans = ct.transform(X_test)
lr.fit(X_train_trans, y_train)
train_preds = lr.predict(X_train_trans)


In [None]:
print("Visualization of Classification between 2 numerical variables")
grp.visualize_classification(X_train, train_preds).show()