## Dataset Used: Medical Speech, Transcription, and Intent 
#### (Source: Kaggle https://www.kaggle.com/paultimothymooney/medical-speech-transcription-and-intent)

#### From phrase (users feeling) we are predicting the diseases he/she might be suffering from, using Multinomial Naive Bayes as model and NLP for pre-processing of data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [2]:
df = pd.read_csv("overview-of-recordings.csv")

In [3]:
df.head()

Unnamed: 0,audio_clipping,audio_clipping:confidence,background_noise_audible,background_noise_audible:confidence,overall_quality_of_the_audio,quiet_speaker,quiet_speaker:confidence,speaker_id,file_download,file_name,phrase,prompt,writer_id
0,no_clipping,1.0,light_noise,1.0,3.33,audible_speaker,1.0,43453425,https://ml.sandbox.cf3.us/cgi-bin/index.cgi?do...,1249120_43453425_58166571.wav,When I remember her I feel down,Emotional pain,21665495
1,light_clipping,0.6803,no_noise,0.6803,3.33,audible_speaker,1.0,43719934,https://ml.sandbox.cf3.us/cgi-bin/index.cgi?do...,1249120_43719934_43347848.wav,When I carry heavy things I feel like breaking...,Hair falling out,44088126
2,no_clipping,1.0,no_noise,0.6655,3.33,audible_speaker,1.0,43719934,https://ml.sandbox.cf3.us/cgi-bin/index.cgi?do...,1249120_43719934_53187202.wav,there is too much pain when i move my arm,Heart hurts,44292353
3,no_clipping,1.0,light_noise,1.0,3.33,audible_speaker,1.0,31349958,https://ml.sandbox.cf3.us/cgi-bin/index.cgi?do...,1249120_31349958_55816195.wav,My son had his lip pierced and it is swollen a...,Infected wound,43755034
4,no_clipping,1.0,no_noise,1.0,4.67,audible_speaker,1.0,43719934,https://ml.sandbox.cf3.us/cgi-bin/index.cgi?do...,1249120_43719934_82524191.wav,My muscles in my lower back are aching,Infected wound,21665495


In [4]:
df_final = df[['file_name','phrase','prompt','overall_quality_of_the_audio','speaker_id']]

In [5]:
df_final.head()

Unnamed: 0,file_name,phrase,prompt,overall_quality_of_the_audio,speaker_id
0,1249120_43453425_58166571.wav,When I remember her I feel down,Emotional pain,3.33,43453425
1,1249120_43719934_43347848.wav,When I carry heavy things I feel like breaking...,Hair falling out,3.33,43719934
2,1249120_43719934_53187202.wav,there is too much pain when i move my arm,Heart hurts,3.33,43719934
3,1249120_31349958_55816195.wav,My son had his lip pierced and it is swollen a...,Infected wound,3.33,31349958
4,1249120_43719934_82524191.wav,My muscles in my lower back are aching,Infected wound,4.67,43719934


In [6]:
df_final = df_final.drop(columns=["file_name", "overall_quality_of_the_audio", "speaker_id"])

In [7]:
df_final.head()

Unnamed: 0,phrase,prompt
0,When I remember her I feel down,Emotional pain
1,When I carry heavy things I feel like breaking...,Hair falling out
2,there is too much pain when i move my arm,Heart hurts
3,My son had his lip pierced and it is swollen a...,Infected wound
4,My muscles in my lower back are aching,Infected wound


In [8]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df_final["phrase"].to_list())
X_train_counts.shape

(6661, 1147)

In [9]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(6661, 1147)

In [10]:
clf = MultinomialNB().fit(X_train_tfidf, df_final["prompt"].to_list())

In [12]:
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()),])

In [13]:
text_clf = text_clf.fit(df_final["phrase"].to_list(), df_final["prompt"].to_list())

In [14]:
text_clf.score(df_final["phrase"].to_list(), df_final["prompt"].to_list())

0.9863383876294851

In [15]:
test_data = ["Hello doctor, I am having pain in my joints."]

In [16]:
text_clf.predict(test_data)

array(['Joint pain'], dtype='<U18')