# Classifying doctor violations by hand, then by machine

You're looking out for certain types of doctor violations! Whether keeping poor records, being addicted to drugs, or anything else. **You decide.**

**You're going to see how often doctors lose their license for that violation.** There are about 7000 records, though, and you ain't going to read all of them!

Steps:

1. **Classify some violations by hand**
1. Vectorize the **hand-classified violations**
1. Train a classifer on the **hand-classified violations**.
1. **Test the classifier**. If it's good, next step! If not, go back to training.
1. Vectorize the **unclassified violations**
1. Use the classifier to **predict the labels of the unclassified violations**
1. What actions were taken against those doctors?

It'll be magic!

In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv("physicians-ny-violations.csv")
df.head(2)


Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,


In [3]:
df.shape


(7140, 13)

## Step 1: Classify some by hand

If you had a CSV with some sort of key in common, you'd be able to just do a join. But we don't! So **I'm going to help you out**.

I wrote this little script to help you **classify content by hand**. It will print the violation, then it's what you're looking for. If you type "y" or "Y" before hitting enter, that means YES. Once it's done it'll add the results to the dataframe in a column called `category`.

In [6]:
number_to_classify_by_hand = 100


In [7]:
def is_what_you_want(row):
    response = input("\n------------\n\n{desc}\n\n\nIS THIS WHAT YOU'RE LOOKING FOR? y for YES ".format(index=row.index, desc=row.misconduct))
    if response == "y" or response == "Y":
        print("\n** Classified as YES **")
        return "YES"
    else:
        print("\n** Classified as NO **")
        return "NO"

# Reset category column
df['category'] = np.nan
df['category'] = df[:number_to_classify_by_hand].apply(is_what_you_want, axis=1)

df.category.value_counts()



------------

The corporation admitted guilt to the charge of ordering excessive tests, treatment, or use of treatment facilities not warranted by the condition of a patient.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The corporation admitted to the charge of having been convicted in New York Supreme Court, Kings County of a scheme to defraud in the first degree; falsifying business records; insurance fraud and failing to comply with the requirements of the New York State Business Corporation Law Section 1503(a).


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

This action modifies the penalty previously imposed  by Order# 93-40 on March 31, 1993, where the Hearing Committee sustained the charge that the physician was disciplined by the Utah State Medical Board, and ordered that if he intends to engage in practice in NY State, a two-year period of probation shall be imposed.


IS THIS WHAT YOU'RE LOOKING F


** Classified as YES **

------------

The Corporation was rendered in violation of New York State Business Corporation Law Section 1503(a) and (b) and 1504(a) due to the surrender of the sole shareholder's medical license.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES y

** Classified as YES **

------------

The Hearing Committee sustained the charge finding the physician guilty of having been disciplined by the Illinois State Department of Professional Regulation for filing insurance claims for services which were not rendered to patients.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The physician assistant did not contest the charge of fraudulent practice due to prescribing controlled substances for her own use.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

This order is a modification of the terms previously imposed on June 7, 2013 and does not constitute a new disciplinary action.  Previously he phys


** Classified as NO **

------------

The physician admitted guilt to the charges of failing to comply with a State law governing the practice of medicine and practicing beyond the scope permitted by law.Previously the physician's New York State medical license was summarily suspended on February 25, 2005.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES y

** Classified as YES **

------------

The physician admitted guilt to the charges of negligence on more than one occasion and failing to use scientifically accepted barrier precautions and infection control practices.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The Hearing Committee sustained the charge finding the physician guilty of having been convicted in United States District Court,Southern District of New York of illegally distributing and dispensing Schedule III and IV controlled substances.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The physi


** Classified as NO **

------------

The physician did not contest the charge of gross negligence


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The physician admitted to having been disciplined by the Rhode Island State Board of Licensing and Discipline for having sexual contact with a psychiatric patient.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The physician admitted she could not successfully defend against the charge of having been disciplined by the Florida State Board of Medical Examiners for failing to practice medicine with an acceptable level of care.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

This action is not disciplinary in nature.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The physician did not contest the charges of having been convicted in Superior Court of New Jersey, Monmouth County of theft by dec

NO     88
YES    12
Name: category, dtype: int64

In [24]:
df.head()


Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category,category_as_0_and_1
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0
2,License Surrender,,01/13/1999,Joseph,Aaron,72800,MD,,This action modifies the penalty previously im...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1927.0,NO,0.0
3,License limited until the physician's North Ca...,12/06/2005,12/13/2005,Mark,Aarons,161530,MD,Gold,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,NO,0.0
4,License surrender.,08/07/2013,08/14/2013,Jamsheed,Abadi,136045,MD,S,The physician did not contest the charge of fa...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1939.0,NO,0.0


## Step 2: Vectorize the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.**

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


In [26]:
vec = CountVectorizer(stop_words = 'english', max_features=3000)

matrix = vec.fit_transform(df['misconduct'].fillna('').str.replace("\d",""))
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

# Just the ones I classified:
features_df[:100]


Unnamed: 0,abandoment,abandoning,abandonment,abdominal,abetting,abide,abilities,ability,ablations,able,...,wyoming,xanax,xenical,yacht,yag,year,years,yolo,yonkers,york
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
6,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 3: Create a classifier and train a model using the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.** You'll also need to make the `category` column a number, probably.

And remember your test/train split!

In [9]:
# df_50 = df.head(50)


In [27]:
df["category_as_0_and_1"] = df.category[:100].apply(lambda x: int(x == "YES"))
df


Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category,category_as_0_and_1
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0
2,License Surrender,,01/13/1999,Joseph,Aaron,72800,MD,,This action modifies the penalty previously im...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1927.0,NO,0.0
3,License limited until the physician's North Ca...,12/06/2005,12/13/2005,Mark,Aarons,161530,MD,Gold,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,NO,0.0
4,License surrender.,08/07/2013,08/14/2013,Jamsheed,Abadi,136045,MD,S,The physician did not contest the charge of fa...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1939.0,NO,0.0
5,The New York State Board of Regents restored t...,04/15/2004,04/13/2004,Abdul,Abbasi,183025,MD,Hafeez,"Previously the Review Board on November 14,199...",https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1955.0,YES,1.0
6,"License surrender, $5,000 fine",06/21/2001,04/29/1996,Samih,Abbassi,171180,MD,R,The physician did not contest the charges of n...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0
7,The physician is subject to a license limitati...,12/08/2014,12/15/2014,Aiman,Abboud,232699,DO,Michael,The physician asserted he could not successful...,https://apps.health.ny.gov/pubdoh/professional...,The physician is subject to a license limitati...,https://apps.health.ny.gov/pubdoh/professional...,1962.0,NO,0.0
8,License surrender,09/26/2011,10/03/2011,Naglaa,Abdel-Al,227440,MD,Zidan Elsayed,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO,0.0
9,No action taken against the physician's licens...,06/21/2001,04/11/2001,Mohammad,Abdel-Hameed,173309,MD,Fathi Ahmad,The Hearing Committee sustained the charge fin...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1938.0,NO,0.0


In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features_df[:100].values,
    df["category_as_0_and_1"][:100],
    test_size = 0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(80, 3000) (20, 3000) (80,) (20,)


In [13]:
# from sklearn.linear_model import LinearRegression

# lr = LinearRegression(fit_intercept = 0)
# lr.fit(X_train, y_train)


In [29]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

clf = BernoulliNB()
clf.fit(X_train, y_train)


BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

## Step 4: Test the classifier

How does it look? Remember, we're only using the classified ones so far!

**If you don't like its predicting ability**, go back up and play around with your vectorizer, and even with your classifier. There are a lot of options!

In [15]:
# from sklearn.metrics import mean_squared_error, r2_score

# print('R2 on training data:', r2_score(y_train, lr.predict(X_train)), '\nMean Squared Error on training data:', mean_squared_error(y_train, lr.predict(X_train)))

# print('R2 on test data:', r2_score(y_test, lr.predict(X_test)), '\nMean Squared Error on test data:', mean_squared_error(y_test, lr.predict(X_test)))


In [30]:
print(clf.score(X_test, y_test))
print(clf.predict(X_test))
print(y_test)


0.8
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]
46    0.0
53    0.0
45    0.0
60    0.0
11    0.0
0     0.0
47    0.0
79    0.0
50    0.0
52    1.0
40    0.0
88    1.0
13    0.0
5     1.0
8     0.0
56    0.0
58    0.0
20    0.0
21    0.0
77    1.0
Name: category_as_0_and_1, dtype: float64


## Step 5: Vectorize the unclassified violations

Now we need to vectorize the violations we didn't classify by hand.

You **DO NOT MAKE A NEW VECTORIZOR**. You juse use the one we already have! Also, you **DON'T FIT IT AGAIN!** You just transform. I hope you read this line, but I'll give you some code anyway.

In [17]:
# features_df[50:]


In [31]:
not_categorized = df[df["category_as_0_and_1"].isnull()]
matrix_not_categorized = vec.transform(not_categorized['misconduct'].fillna('').str.replace("\d",""))
features_df_not_categorized = pd.DataFrame(matrix_not_categorized.toarray(), columns=vec.get_feature_names())
features_df_not_categorized.head()


Unnamed: 0,abandoment,abandoning,abandonment,abdominal,abetting,abide,abilities,ability,ablations,able,...,wyoming,xanax,xenical,yacht,yag,year,years,yolo,yonkers,york
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 6: Use the classifier to predict the labels of the unclassified violations

You **DON'T NEED A NEW CLASSIFIER**, use the one you have! You'll use `clf.predict`, and feed it... what? What does it need to predict the labels?

In [32]:
clf.predict(features_df_not_categorized)


array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

### Step 6.2: Those labels are ugly

If you used a `LabelEncoder` to create your categories, you can feed the numbers to `le.inverse_transform` to get actual text back.

In [33]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(["No", "Yes"])


LabelEncoder()

In [34]:
le.transform(["No", "Yes"])
    

array([0, 1])

In [35]:
le.fit_transform(["No", "Yes"])


array([0, 1])

### 6.3: Put the category labels back into the original dataframe

In [58]:
df["category"][100:] = clf.predict(features_df_not_categorized)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [63]:
df["category"][100:].value_counts()

0.0    7040
Name: category, dtype: int64

In [74]:
df['category'][100:] = le.inverse_transform(df["category"][100:].astype(int))
# df['category']
df['label'] = df['category']
df


Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category,category_as_0_and_1,label
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0,NO
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0,NO
2,License Surrender,,01/13/1999,Joseph,Aaron,72800,MD,,This action modifies the penalty previously im...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1927.0,NO,0.0,NO
3,License limited until the physician's North Ca...,12/06/2005,12/13/2005,Mark,Aarons,161530,MD,Gold,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,NO,0.0,NO
4,License surrender.,08/07/2013,08/14/2013,Jamsheed,Abadi,136045,MD,S,The physician did not contest the charge of fa...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1939.0,NO,0.0,NO
5,The New York State Board of Regents restored t...,04/15/2004,04/13/2004,Abdul,Abbasi,183025,MD,Hafeez,"Previously the Review Board on November 14,199...",https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1955.0,YES,1.0,YES
6,"License surrender, $5,000 fine",06/21/2001,04/29/1996,Samih,Abbassi,171180,MD,R,The physician did not contest the charges of n...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO,0.0,NO
7,The physician is subject to a license limitati...,12/08/2014,12/15/2014,Aiman,Abboud,232699,DO,Michael,The physician asserted he could not successful...,https://apps.health.ny.gov/pubdoh/professional...,The physician is subject to a license limitati...,https://apps.health.ny.gov/pubdoh/professional...,1962.0,NO,0.0,NO
8,License surrender,09/26/2011,10/03/2011,Naglaa,Abdel-Al,227440,MD,Zidan Elsayed,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1969.0,NO,0.0,NO
9,No action taken against the physician's licens...,06/21/2001,04/11/2001,Mohammad,Abdel-Hameed,173309,MD,Fathi Ahmad,The Hearing Committee sustained the charge fin...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1938.0,NO,0.0,NO


In [46]:
predictions = clf.predict(features_df_not_categorized)
predictions = predictions.astype(int)
le.inverse_transform(predictions)


array(['No', 'No', 'No', ..., 'No', 'No', 'No'], 
      dtype='<U3')

## Step 7: What actions were taken against those doctors?

In [47]:
predictions_column = pd.Series(le.inverse_transform(predictions))
predictions_column.value_counts()


No    7040
dtype: int64

In [None]:
# number_to_classify_by_hand = 2  #             No action?!
# number_to_classify_by_hand = 50 # No    7090, No action?!
# number_to_classify_by_hand = 2  # No    7040, No action?!
